# Distributed Code Jam 2016 Round 1

So this year, the structure is a bit different 2 rounds before the grand final for distributed and more people eligible to compete in DCJ.

As usual I’ve been busy, so Round 2 is almost about to happen, I figured it was time to write up my analysis of Round 1.

Unlike round 2 of the main code jam, I think I would have had a chance of advancing to round 2 of DCJ this year if I was competing.  I usually say there is practically no chance of me writing up 4 solutions in 3 hours, but this round the questions seemed pretty easy (except for the large of the last one), so I think there would have been a real chance of me getting a top 200 placing.

Q1) Given some inefficient source code, write an equivalent, but efficient version.

A1) So the detail is of course in the provided source code.  It is of course a bit convoluted, but the net result is its calculating the largest difference between any two values in the input.

Of course the best way to do that is to calculate the global maximum and global minimum and take their difference.  The large input there is too much data to do this all in one node (of course), but if you break the input in to approximately equal chunks per server, each server calculates min/max – sends to central server which calculates the min of mins and max of maxs.  Then the difference of those two is the largest difference and that central server can print it out.

I suspect the biggest chance of failing this question is in breaking up the input incorrectly.  Its a very common task in DCJ, you always want to keep the code around to do this, you need to handle cases where there is less input then there are nodes, or if you try to allocate equal size to each node that some nodes won’t get any.  If you want to get tricky and try dividing up the input by modulo rather than blocks (which can seem easier) remember to write your loop using += modulus rather than checking every number against the mod.  1 billion divisions is going to time out…

Another risky move would be to read all the values you need into memory rather than just calculating the min/max on the stream of values as you read them in.  This should only take 80MB of ram per node in the worst case, but if you are using a doubling resizeable data structure like vector, you will be very likely to hit 128MB instead, maybe even higher…  Reserve the required space to avoid this.

Q2) Determine who wins a giant rock paper scissors competition if everyone always plays a specific move, and in case of draws the person with lower id wins.

A2) The large input size is only 2^28 unlike the 1 billion for the first problem, but its still easily large enough for you to time out if you tried to read all of the values on a single node.  Breaking the input up in to approximately equal chunks across all servers is not a good strategy this time, a much better option is to break it up into a number of exactly equal chunks which is the largest power of 2 which is less than the number of servers.  This means each server is running its own isolated competition with a single winner.  Each single winner can then be passed to the central server and it can continue from there.

This one doesn’t seem very risky so long as you break your input up by powers of two.  Only 4 million characters, you can read them all in to memory, and even not bother reserving, or reusing vectors as you run the contest (even if you happen to be using 16bit characters…).  Run time should be pretty fast too.

Q3) Determine the minimum number of moves to restructure a bunch of piles of crates to have the specified almost uniform height across all piles. (All equal, or one higher for some of the earlier stacks if the total isn’t exactly dividable amongst the number of stacks.

A3) Definite step up in difficulty here.  The contest analysis describes one approach, I’m going to describe a slightly different one which I think is a bit simpler logically, although its a bit more difficult to code to be fast enough…

First phase we need to know how high each stack should be.  To know this we need to know how many crates there are in total.  So every node takes a chunk and sums it up and sends to central server which sums the sums, and then rebroadcasts that grand total back to everyone…

The core trick the problem is that because the crates can only be moved one at a time to a neighbouring stack, the problem can be solved left to right one stack at a time.  First stack knows how high it should be, how high it actually is.  If its too high, k moves to move the excess to the next.  If its too low, it needs to steal from its neighbours to the right, so walk right stealing until you have enough.  Steal from the closest first, and as you get further away you each steal costs multiple moves to move through all the intervening stacks.

For the distributed version each node needs to solve in isolation to be fast enough, so at some point you are going to dump a bunch of crates on to the next server, or you are going to need a bunch of crates from the next server.  For the purposes of stealing once you get the edge you just assume the other server will have moved them in to your last stack and you can count the cost of moving them from that stack to the right locations.

So, since each server will need to know the flow at the serve edge to calculate the correct value, the central server might as well calculate that while its calculating the grand total.  Since all it needs to work that out is know what the desired height total for each server is, and the total.  Since it just worked out the grand total, the desired height total for each server isn’t much harder, with the one corner case of the server where the desired height changes.  Then going from left to right comparing the sum to the desired total it can work out the flow to the next server, and assuming the flow and next sum and compared to the next desired total it can determine the next flow after that….

It then shares the flow to server before and server after to every server.  The initial and final flows are of course 0.

Each server node then can process left to right.  First deal with the flow.  Positive flow in to this node you just add to the corresponding edge node.  Negative flow you start the standard stealing process to calculate.  Remembering to track the cost.

Once the flows are fixed you can just solve the problem locally given the desired height calculated from the grand total.

Gotchas to watch out for.

1. Stealing from neighbours can cause you to run too slow if you try to check the same set of neighbours over and over.  You need to keep track of where you have already picked clean, and start one right of that any time you need to steal to ensure that the local calculation part is linear in the size of the section assigned.
2. Even though each stack size is only up to 1 billion, the flows calculated can be in the hundreds of millions of billions.  Therefore you must use 64 bit entries.  Given each node is dealing with up to 10 million entries, this means you must reserve to avoid the risk of hitting 128MB.
3. Because of the large flows that can travel across nodes, when stealing the cost calculated from a single steal can overflow 64 bits.  Either do the product with 128bit types, or (more sensibly) ensure that you mod the number of crates being moved before you multiply by the distance moved.
4. Mod everywhere – the product I mentioned above is the biggest individual risk, but there are plenty of other ways to overflow if you don’t remember to apply the mod during the process and not just at the end.

Q4) Determine the smallest positive integer in a set which is unique.

A4) The small can be run on a single node, and is easy to write.  Just read the values in, sort them, return the first which is not equal to one of its neighbours.  Or zero if you get to the end of the array.

The large is only 35 Million, which if it wasn’t for the memory limit is almost able to be processed in the 4 second time limit…  Well, maybe it would need 7-8 seconds…

The contest analysis solution is much safer, rather than trying to do a distributed merge sort, break the problem in to chunks and then combine the results.  Unlike the other problems where the chunks are segments of the input stream, (or stripes if you like for the first problem) the appropriate chunks depends on the input.  In order for each section of input to be processed independently, we need to be sure that all duplicates of any given number are processed by the same node.  This can be achieved by a hash function.  Each node processes a segment, and after locally separating the dupes from the uniques, forwards values to other servers based on their hash.  Each server then receives a two sets of values from other servers, the uniques and the dupes.  Since it now knows all copies of a specific value are together, it can make a call as to the lowest unique which has a specific hash assigned to this server.  It sends these locally lowest definitely globally unique values to the central server, which can then calculate the global lowest.

With the relatively small input size, if you are using all 100 nodes it would seem that there is no risk of running out of ram following the general strategy described above.  But if you don’t reduce the input sent to specific servers by hash down to the two sets and instead send all the values, a single server can easily be overwhelmed by the input case where all the values are the same.  If your hash is terrible (ie mod 100), similar problems can be had.  But if your hash is decent, and you do correctly filter the data down before sending it on, it should be pretty safe.