# Distributed Code Jam 2016 R2

15 to the world finals and at the moment the top 12 solved all the problems. To get through required a good time and only failing one of the larges, and not the one worth the most points.

I think I could have pretty easily gotten a top 200 placing, with the 2 easiest larges and the other two smalls.  None of the problems seem ‘too’ difficult, but writing the correct solution to 3 larges in the time available is probably a stretch for me.

Q1) Given some inefficient code, write equivalent but efficient code to solve the problem.

A1) Like last time the detail is of course in the code given.  This time it was a bit more deceptive.  It appeared to be calculating as a sum the all pair product between a set A and a set B, but subtly the central server doesn’t do any calculating, so its actually the all pair product where the pairs don’t sum to 0 mod 20.

The trick here is to realise the all pair product is sum of all elements in set A times the sum of all elements in set B.  Removing the pairs which sum to 0 mod 20 is a bit more tricky, but its the product of the sum of the 0 mod 20 in set A and set B plus the product of the sum of the items which are 1 mod 20 in set A and the items which are 19 mod 20 in set B, and so on.

To simplify the code slightly rather than calculating the total and the removing, you can just calculate the sum of items that are x mod 20 in set A and y mod 20 in set B and sum all pair product where x+y mod 20 != 0

All that remains is to solve the problem in a distributed scenario where there are 100 million elements in each of set A and set B.  Summing things is easily distributed, so each node creates to sum tables, one for each of set A and set B with an entry for each mod 20.  Then it processes its chunk of the two inputs calculating the mod and summing it into the correct chunk.  Once done, send both tables to the central server, which sums all the tables together and then performs the final pair product step.

One thing is that the answer can easily be too big for 64 bit, so the problem actually wants it modulo 1 billion and 7.  As with many problems with this requirement, its important to remember to apply the modulo after every calculation.  And even though the modulo and all individual inputs are valid for 32 bit, calculating products will have intermediary values which need 64 bit precision.

Q2) Determine the longest prefix of a sequence of brackets which can be extended in to a balanced sequence.  Return -1 if the entire sequence is already a balanced sequence.

A2) So the question is worded a bit more vaguely than above, but if you’ve had some exposure to bracket systems you’ll know that the description is just equivalent to any balanced sequence of brackets.  At first it might seem this problem will be difficult to do distributed because each segment’s validity clearly depends on the one before, but one of the details of a balanced sequence of brackets is that any sequence of brackets can be made balanced, so long as the running sum of open (+1) and closed (-1) never goes negative.  Therefore each segment just depends on the running sum of the previous segments.

So each node calculates its total for the local running sum, sends them to the central server which accumulates those running sums and sends back the correct starting total for each node, to that node.  Now each node can check whether the true running sum goes negative at any point.  If it does, it sends that position to central server, otherwise it sends -1.  The central server then returns the first non-negative result.  Otherwise it checks if the grand total is 0, if so it returns -1, otherwise it returns the length of the full data set.  (This last conditional I think is probably the easiest thing to miss in this problem…)

Since the data is only 1 character per value, memory use isn’t a high risk, and the cost per read of value is low enough that you could probably read each value assigned to the node twice without running out of time if you don’t want to read in to a local array.  The usual gotchas around splitting data amongst the nodes apply.

Q3) Given a game where you can move left or right 1 unit or stay still in any given turn, and the rest of the game map moves down one unit after your move.  Determine the maximum number of points you can get given any choice of starting locations.  Each non-wall containing cell is worth between 0 and 9 points which you earn just by entering it.

A3)  The small input can be easily done on a single node, and the solution is a dynamic program on the maximum number of points you can get in total once you arrive at position x.  From each row from bottom to top you try each non-wall position, then there are 3 possibilities, if the position in front isn’t a wall max its current value with the sum of the local current value and the value of that cell.  Then if the right and right forward diagonal aren’t walls, similar with the right diagonal cell, but also add in the value of the right cell.  Similarly with left diagonal.  This is seeded by the value of each non-wall position in the bottom row, and the final result is the maximum value in the row above the top of the input.  This DP is O(1) per cell, so can be completed very quickly in the small input scenario where the maximum size is 1000 by 1000.  The large input of 30000 by 30000 would be almost processed in time by a single node, if it wasn’t for the memory limits and the time to read the raw input.

So the question becomes, how to break the problem up to pass the large input across multiple servers.  Horizontal stripes are a bad idea, because the mathematical operation is a max of sums and it mixes values from a lot of locations so it doesn’t distribute well trying to chain it with results that arrive after you’ve started.  So the solution is vertical stripes.  However the edges of each row of a stripe are dependent on one cell to the right and left of the stripe of  the row below.  If each server was to communicate its edges with every row, that would be 30 thousand messages, which given the estimated 5ms latency per message is far far far too long to run in time.

In order to send less messages each node needs to do more work.  Because the game is limited to one move left or right per row, if we want to only send a message every 300 rows (which brings in a total of half a second expected latency instead of 150 seconds) we need to start with an extra 600 width on our base section, then 598, 596…   This doubles the number of values which has to be read by each server if they would usually be dealing with a 300 wide stripe (and doubles the number of DP calculations, but the reading of the input dominates).

In order to continue after we’ve done 300 rows, we need to share the correct values for the full set of 300 neighbours on each side – which we can get from each of the neighbouring servers if we’ve set it up right…  Once we’ve got that information we can continue for another 300 rows.

So, does this solution scale well enough?  Estimated latency waiting for network is 500ms.  Time to ready the values for the stripe and half of each neighbour is 30000*600*0.05us = 0.9 seconds.  Actual computation time is much lower.  So 5s should be plenty of time.  Memory usage is 18MB minimum, but quadruple that if you use 16bit chars and a resizing array without calling reserve.  Still fits.  In fact it fits well enough you might as well make your life simpler and just use a full 3 widths allocation even though you’ll only use half of each of the outside areas.

Each message sent is 1200 bytes, you send up to 100 messages to 2 servers, this isn’t even a whole megabyte…  And the 100 messages is well below the 1000 limit.

Gotchas:

1. Using your normal logic for distributing the columns to servers can result in server with much less than 300 width.  If your code mistakenly decides to send in height chunks equal to the width allocated to the server, you can easily become too message spammy…   If you don’t, you need to start requesting results from more than one server over, and again you become too spammy. You want to explicitly allocate exactly 300 main columns to each server to make this simpler.  If you don’t use all the servers, that’s okay, it will still easily run in time.  More generically (if you don’t like writing your code to the specific problem constraints…) you should divide it in to equal chunks of width depending on the longest of the two dimensions.
2. One server has a different width, so its neighbouring server gets complicated…  Solution is to pretend that the total width is actually exactly a multiple of 300, and fill the extra columns with walls.  Makes the code for reading values in to the array slightly more difficult, but makes everything else much simpler…

Q4) Determine the cheapest way to get from a to b if there is a petrol station every km and you use one litre per km, and have a maximum fuel tank size T.  Each petrol station has its own price per litre.

A4) So I struggled for a while just to get the logic right for the small input, but eventually I realised its actually quite simple.  For each km, the cost you paid for that km, is the cheapest of the previous T petrol stations.  So if you keep a running minimum you just add the minimum value for each node as you go.

As I mentioned for the first problem, the sum operation distributes well.  A running min on the other hand…

So, the solution is all about calculating a running min.  If T is small, this is easy, just do it locally.  More specifically, if T is less than the width you allocate to each node, the problem might as well be solved locally, each node potentially doubles the number of values it has to read in, but that is still easily managed against the time and memory limits with some care.  Each node calculates its sum and sends it to the central server which creates the grand total.

However there is no such limit on T, it could be very large…  However, if T is larger than the width you allocate to each machine, you don’t have to do a running min on that machine, its just a simple min of the values seen so far.  Any value added is never removed.  However you also need to consider the minimum of some of the values allocated to other nodes, and that you do have to remove values from.  But before we get into that detail.. Consider if T is larger than double the width.  In that case there is an entire server’s worth of values which will always be in the running min for any given server other than the first.  Here we find the opportunity to calculate something distributed which will help the other nodes.

So the first step is for each node to find its minimum price of its allocated section, and broadcast that to every node to the right of it.

So, now for each node the running min consists of 3 (or 4) values of which we take the minimum.

1. The minimum seen so far in this chunk that we are calculating the sum for.
2. The minimum of the previous servers which are always completely covered by the fuel tank size T.
3. The minimum of the previous server which is currently completely covered (but won’t be later once we’ve advanced more).
4. The minimum of the partial server section needed to fill out T.

The first two of these are easy, the third is easy if its present.  The fourth presents a bit of a challenge.  I propose that the simplest way is for the local server to read the values allocated to that partial server, in reverse order from the end of that server allocation.  As it finds a new local minimum, it records it and its position.  Then as you advance pointer you check if the current far minimum is still valid, if not, pop it off the stack and use the one underneath.  Worst case this stack contains the entire contents of a servers allocation, at 8 bytes a pop, 4 bytes for the minimum (since they all fit under 1 billion) and 4 bytes for the index.

So, does it scale well enough.  Well the worst case tank size is when it causes some servers to create the stack for almost one entire server, then again for another server because advancing the width just drops over into the next server, and in order to calculate that minimum you have to process the entire server’s width.  The processing time itself is cheap, the memory allocations will fit so long as you reuse your reserved size stack when switching servers at the boundary… (Should be under 80MB if you do it right.) Network latency is not significant…  The big concern is the data reading time.  5 million positions allocated per server, reading time 0.75 seconds just for its own.  Triple that and its getting close to half of the allowed time limit just reading the data.

So, it might be good enough… but can we do better?  The answer is kind of yes.  At the beginning we calculated a single min per server and sent that around.  If instead we calculate 100 mins per server, we don’t make calculating the global minimum for the trivially covered section much slower, but it reduces the size of the worst case stack by a factor of 100.  On the other hand it increases the worst case number of stacks we have to create from 2 to 101, but because those stacks are all 1/100th the size, we just reduced the overhead from 2 extra servers worth of reading to 1.01 extra servers worth.  This saves you almost 0.75 seconds of reading time in the worst case.  The reduction to the worst case stack size also makes it much much easier to stay under the memory limit.  It increases the network traffic size by a factor of 100, but its still under 100KB sent per node.

Gotchas:

1. I said it was easy for the case where T is small – but for T around the size of 1 width its actually a bit tricky.  Doing a running min the natural data structure is a multiset (an ordered set which allows repetition), but a multiset with 5 million entries can have an an unexpectedly large memory cost, well beyond the 40MB you might expect.  Without adequate knowledge of the implementation details of the multiset included in your programming language, you definitely run risk of running out of ram…  This can be solved by the same extension proposed to reduce the read worst size in the very large T case.  Using the locally calculated 100 mins, the size of T where you need to do a true running min is reduced to only 50thousand elements, which a multiset will easily fit in to memory for.