The Missing Docs .Net – Finding the holes in the Net

Ceres

Ceres is a newly released chess engine which has a re-implementation of MCTS search over the lc0 backends. It shows very strong search performance characteristics which will be interesting to learn from. The specific details of some of those performance characteristics would likely be difficult to copy in to lc0, without also requiring a full rewrite. So there has been some talk of possibly migrating to Ceres as a a replacement for the lc0 engine in general. This post isn’t about that though. Instead I wanted a place to capture the ‘details’ of difference as I read through the Ceres codebase.

Move selection

Ceres has a couple of novelties in move selection.

It has an option to take ‘best Q’ in preference to ‘best N’ so long as the visit count is still a decent fraction of best N. Specifically this fraction depends on the search tree size and the delta in Q. The values look empirical and are step wise. Maybe a curve fit of some kind could make this a more practically tune-able feature. Notably even for the largest Q deltas in a large tree the N required for a node to be selected is 30% of best N.
Due to supporting a ‘best Q’ based approach, it has a function for adding a ‘boost’ to the Q value used based on the moves left head output. Curiously its based on the moves left estimate of the ‘best Q’ without filtering for N, rather than the moves left estimate of the ‘best N’ which should be more accurate.

Search

By far the largest section of novelties is going to be the search itself… I’m not covering the micro-optimizations here, of which there are plenty. (Hand rolled AVX2 parallelization of the UCT select inner loop for instance.)

Try to gather x nodes per tree walk. Where lc0 performs a separate tree walk for each node to be gathered and performs out of order updates after each such node, Ceres gathers multiple nodes in a single tree walk. This is something I’ve previously considered but was concerned that the interaction with out of order would cause too much of a problem. The basic performance benefit of not walking the same nodes multiple times to gather the batch has already been partially offset in lc0 via some smart local caching, which Ceres doesn’t need as a result. Ceres splits the gathered set and applies the out of orders as a bunch. This risks a problem that lc0 has had in the past where we allowed multiple hits on a terminal to apply at once without considering the actual back prop of those terminals. We disabled multiple visits because the speed in rare circumstances didn’t offset the risk. With Ceres however it is core to gathering speed in all scenarios, so it is likely that the overall speed improvement can offset the risk. Additionally Ceres does some other things that probably helps here… but maybe more importantly this risk is linked to batch size, not collision limits like we used to have it set up in lc0.
Adaptive parallel search, if the search tree has a ‘nice’ fork point – use a second thread for the other side of the fork. This is only doable because of having a target of x nodes to gather at once. These independent parallel searchers could effectively share a single lock since they don’t work on any shared data, except for transposition handling getting in the way (?Apparently?)
Adaptive batch sizes. When the tree is small Ceres tries to gather batches far smaller than optimal for gpu utilization. This probably offsets some of the risk of getting multiple hits to a terminal without back prop, which is especially dangerous when the size of the batch is a noticeable as a proportion of the total tree.
Even with adaptive batch sizes, split them smaller for the parallel gather and use heuristics to decide whether to do the rest. This seems like further offsetting the out of order risk, but keeping the potential to gain search performance from the adaptive parallel search.
Two gatherers that use semi-independent n in flight calculations (adding in a fraction of the others), with separate cpuct values to reduce the probability that they gather nodes that are similar. I’m unclear why this can really work in practice (although the fractional addition of the other helps), but assuming it does it is pretty nice.
Fractional relative virtual loss. This looks like a critical component at first glance. One of the issues facing trying to gather ‘x nodes at once’ in lc0’s model is that we don’t use virtual loss – instead we use ‘virtual visit’, leaves the nodes current q value in place and just increments N. True virtual loss causes selection of really bad nodes too easily, but virtual visits can require quite a large number of visits in order to change selections. Thus lc0 often has to calculate a ‘multivisit’ collision in order to avoid just colliding on the same thing over and over. Ceres uses an alternative in between, using a parent relative virtual loss that is ‘small’ as opposed to the absolute virtual loss (unless the position is absolutely losing already, in which case its worse). This makes it much easier to gather batches, but the batches still aren’t as bad as pure virtual loss.
Transposition handling. Seems like work in progress still, but the default mode does at least return the transposition node Q rather than transposition node V, which is a slight step up over simple caching.
Adaptive batch evaluation. For small search trees it will not let there be any pending evaluations before starting next batch gathering. For lc0 this would be like starting one search thread and then starting another once the tree was big enough. Further offsets the risk during the small tree.
Use of MLH in search is currently disabled – curious what led to that decision…

Backend Handling

Where lc0 uses 2 search threads which block on the backend, to try and keep a single backend busy. Ceres instead allocates 2 separate backend instances, potentially asking the gpu to do more than one piece of work at once. I wouldn’t immediately expect this to be a performance gain, but if it is it should be easy to simulate in lc0 via creating a roundrobin backend with the same gpu listed twice.

And I think that about covers it for an initial pass through the core search components of the code base. It is a large amount of new code so I am sure I’ve missed several things. There are a lot of features I’ve not mentioned, that are currently disabled.

A new post

So, its been a long time since I wrote anything on this blog obviously… I’ve been distracted by some new hobbies. Specifically I’ve been helping to run the training for the leela chess zero project. www.lczero.org

I still do some competitive coding, but not in any public external competitions. So I probably won’t be writing up any summaries any time soon. But maybe I’ll write up some stuff on some other topics…

TCO17 R1B

Less than 500 people turned up for this round, so of course positive scores advanced, of which there was 312. Almost 100 of those solved the 1000pt question, with the 250pt question seeming to have a higher failure rate amongst the top competitors.

Q1) Given two amounts of two different types, and two rates of consumption and the ability to convert type 1 in to type 2 at rate of 2:1, determine the maximum time before one or the other type is exhausted.

A1) So there are two scenarios. First determine the baseline time of exhaustion by dividing the amount by rate of consumption for each type. If type 1 runs out first, there is nothing to be done, return that time. Otherwise the aim is to find an exchange level which has type 1 and type 2 run out at the same time. Since each increment we convert will lengthen the shortest and shorten the longest until that point.

Thus we need to solve (A1-x)/C1=(A2+x/2)/C2 to determine x, and then recalculate either side. x =(C2A1-C1A2)/(C2+C1/2). Note that x looks like it could be negative, but that is the same case as type 1 running out first.

Probable cause of failure is failing to account correctly for the input range. C2*A1 will overflow 32 bit int if you haven’t converted to floating point. C2A1-C1A2 being calculated with small precision float could be an issue as well, but I think using a 64 bit float might be sufficient, definitely fine with 80 bit float. Obviously it could all be calculated using BigInteger fractions, but I’m pretty sure that is excessive.

Q2) Determine a set of positive integers such that the number of distinct sums which can be constructed from them (using each number at most once) is exactly k. Maximum number of values in the set is 20 and k might be up to 1 million

A2) k=(2^N)-1 is an easy first scenario, just use powers of two between 0 and N-1 inclusive.

As an extension of this observation, it can be seen that given an existing set S, you can create a new set S2 which has double + 1 as many distinct output values, by adding a new value which is one more than the sum of the existing values. If instead we add a new value which is equal to the sum, the number of distinct outputs doubles. Now if we can work out a way to go from one set to another such that only 1 new value is added it would seem we are done.

If we can assume that the existing set of outputs when sorted forms the range 1 to x, then by adding the value 1, the new output is 1 to x+1, and we’ve added exactly one new value. Helpfully the proposed doubling scheme above when given the range 1 to x creates 1 to 2x.

Thus we have increment and double operators, which lets us create any number in logarithmic number of steps. However, with a target size of up to 1 million, its easy to construct a scenario where just double and increment take 38 values. But we’re limited to 20.

So we need to be more greedy. Note that we can actually extend our logic above to say that given a set of size x we can create any new size between x+1 and 2x+1, just by adjusting the new value added. Or in reverse, given a current size x, we create it from any size between x-1 and floor(x/2). So rather than building up to the target number, we could instead tear it down. If the target is odd, choose a value for your output set which is half rounding up and solve for half rounding down, if the target is even choose the value equal to half and solve for half. Repeat until target is 0. Now you are done and its obvious that 20 values is sufficient!.

Q3) Given a tree with weighted nodes, determine the set of all sub-tree weights, then determine the sum of x raised to the power of each member of that set. Return the answer modulo 1 billion and 7. Weights of each node may be large, x itself may be large, but there is at most 50 nodes.

A3) Since the weight of each node can be large, and the tree could be very flat, the number of distinct subtree weights could be huge. So enumerating them all is out of the question. So it would appear that the problem needs to be built up from the leaves.

As a simple starting point, consider a node with one child and an unknown number of sub children. The answer consists of the answer ignoring the parent plus the answer including the parent added together. Assume we have the answer for ignoring the parent (since we are planning on building up from leaves) the answer including the parent consists of, the parent on its own (x^(parent weight)) and, the parent plus every subtree including the child (x^(parent weight + subtree weight) summed over every subtree). We already have x^(sub tree weight) summed over every subtree, so this is (1+subtree answer)*x^(parent weight).

So we can now answer a linear tree, but that isn’t very interesting. We need to handle branching. So parent plus two children. (1+subtree1 answer + subtree 2 answer)*x*(parent weight) handles the scenarios of only choosing to include one of the children, but what about some subset of both. Thinking for a bit this should be (subtree1 answer * subtree 2 answer)*x^(parent weight). Subtree 1 is a sum of all the different subtrees, sub tree 2 answer is the sum of all those sub trees, so every combination of them can be formed by calculating the product, and each term of the product is correct as its x^weight we are trying to calculate so x^(weight1+weight2)=x^weight1*x^weight2.

This can obviously be expanded to an arbitrary number of children, but as written its got 2^(number of children) terms – which is far too many given tree branching factor can be as high as 49. So we need to simplify. 1 + subtree1 + subtree2 + subtree1*subtree2 = (1+subtree1)*(1+subtree2). And this pattern holds as number of children increases. So now its a product of a linear number of terms – which can be easily calculated.

Finally once we build all the way up the tree, we then need to sum all the values over the tree, as each node is only calculating the sum for subtrees which are rooted at that location.

Of course the answer can be huge, so we need to consider how the modulo comes in to play. Obviously adding two values is easy. Multiplying two values needs care to avoid 32 bit overflow. But the raw values of x^(current node weight) where both x and weight might be large requires both care for 32 bit overflow, and to perform accelerated exponentiation. But those 3 scenarios cover all the operations needed. No division or subtraction to be handled. Pretty easy as 1000pt questions go.

TCO17 R1A

I got to see 2 am twice staying up for this round, since it was scheduled for the hour when DST ended…

Positive scores advanced, only about 1000 people registered. Only 17 people got all 3 questions out. I think I was close on solving all 3, just ran out of time on the fiddly implementation for Q3, but given how many submissions for Q3 failed system tests, maybe I wasn’t as close as I thought…

Q1) For a table tennis tournament with people with distinct skill levels where greater skill always wins and only one table at which everyone queues, and after n consecutive wins the winner joins the end of the queue after the loser. Determine who wins/loses in the Kth round.

A1) K was only at most 1000, so this was a simple simulation, even if you don’t have a queue data structure handy it’ll trivially run in time. Just need some care tracking how many wins the players have had and be sure to add the retiring winner to the queue after that rounds loser. I guess maybe input sizes of 2 and 3 would also present a potential corner case to some implementations.

Q2) Determine the maximum sum of products of the longer 2 dimensions of rectangular prisms which are made from an original rectangular prism of integer dimensions A,B,C by slicing parallel to a face to create 2 rectangular prisms of integer dimensions, where the pieces have to have a minimum dimension of each S in order to count.

A2) A,B,C could all be up to 100, and S could be 1, so the number of ways to break it down rules out brute force. Instinct suggested a greedy solution was the way forward. Options to consider are slicing in half, or slicing off a slice of size S. Slicing in half gives you two smaller problems, but you can clearly see that you get less slices than slicing off S at a time if you repeatedly try to slice in half. Once you decide to slice off S at a time, you could switch to dynamic programming – state space is at worst size 100^3, with 3 options to check from each state, that will easy run in time.

However I still liked this problem for greedy, so with a little bit of math, you can show that slicing from the smallest dimension is strictly better than slicing from a larger dimension. But if the smallest dimension is less than 2*S, slicing on that dimension has no gain, and slicing in a the next longest dimension is actually a win. Again once the two smallest dimensions are less than 2*S, you slice the longest dimension instead. So first loop summing the product of the longest two dimensions until you get the shortest to less than 2*S, then repeat for until the second shortest is also less than 2*S, then again for the final dimension, and then finally add the product of the two longest remaining dimensions.

Simple mistake that could be made here is that when you are slicing the second or third longest dimensions to get them down to less than 2*S, you may end up with your 3 dimensions changing sort orders. If you don’t resort your array after each run of slicing, you might end up start summing the product of the shortest and longest sides, rather than the two longest sides. I sorted every time I changed the value, just to be safe, sorting 3 values is cheap 😛

Q3) Given a strictly convex polygon with the maximum and minimum y valued coordinates on the y axis, determine the volume of revolution around the y axis.

A3) This question was quite unclear, since given the polygon in most cases crosses the y axis, a full 360 degree rotation will cause the shape to sweep over itself. So I did for a moment wonder if they meant the volume of 180 degree rotation – but the first example created a full cylinder out of a square with one edge on the y axis. So I presume that the answer is the union volume from the 360 degree sweep. I think I might have seen an answer which calculated volume of each half, doubled and added together and subtracted the intersection volume – so I think I was right about that.

So the final shape will be a stack of truncated cones, and the formula for volume of a truncated code is pretty simple. The trick becomes determining the coordinates where each truncated code outside edge starts and stops. Since its a union shape we’re trying to determine the volume of, take each segment on the left of the y-axis and reflect it to the right, and then we need to determine the outside edge of the combined set of segments. Walking the segments in pairs advancing the segment which ends first, there are 3 possibilities for the overlap y range of the two segments. Either one or the other is completely to the left or right of the other, or they could cross. Turn each segment in to a line equation, allows the substitution of the y range min and max to check the maximum x-coordinates for the top and bottom of the range, if they belong to different segments do a segment intersection calculation to find the cross point. Then you get 2 truncated cones instead of one by adding the cross point in between the 2 calculated maximum x coordinates for the start and end of the range..

Corner case – horizontal segments. There can be up to 2 horizontal segments which are at the top and bottom since its convex. These two segments can be ignored, and should be since otherwise you can have both horizontal and vertical segments which makes line equations difficult (without them you can construct the line equations as x = ay+b, otherwise vertical lines are annoying), and you also have segment overlaps with 0 extent in y axis which is confusing.

Distributed Code Jam 2016 R2

15 to the world finals and at the moment the top 12 solved all the problems. To get through required a good time and only failing one of the larges, and not the one worth the most points.

I think I could have pretty easily gotten a top 200 placing, with the 2 easiest larges and the other two smalls. None of the problems seem ‘too’ difficult, but writing the correct solution to 3 larges in the time available is probably a stretch for me.

Q1) Given some inefficient code, write equivalent but efficient code to solve the problem.

A1) Like last time the detail is of course in the code given. This time it was a bit more deceptive. It appeared to be calculating as a sum the all pair product between a set A and a set B, but subtly the central server doesn’t do any calculating, so its actually the all pair product where the pairs don’t sum to 0 mod 20.

The trick here is to realise the all pair product is sum of all elements in set A times the sum of all elements in set B. Removing the pairs which sum to 0 mod 20 is a bit more tricky, but its the product of the sum of the 0 mod 20 in set A and set B plus the product of the sum of the items which are 1 mod 20 in set A and the items which are 19 mod 20 in set B, and so on.

To simplify the code slightly rather than calculating the total and the removing, you can just calculate the sum of items that are x mod 20 in set A and y mod 20 in set B and sum all pair product where x+y mod 20 != 0

All that remains is to solve the problem in a distributed scenario where there are 100 million elements in each of set A and set B. Summing things is easily distributed, so each node creates to sum tables, one for each of set A and set B with an entry for each mod 20. Then it processes its chunk of the two inputs calculating the mod and summing it into the correct chunk. Once done, send both tables to the central server, which sums all the tables together and then performs the final pair product step.

One thing is that the answer can easily be too big for 64 bit, so the problem actually wants it modulo 1 billion and 7. As with many problems with this requirement, its important to remember to apply the modulo after every calculation. And even though the modulo and all individual inputs are valid for 32 bit, calculating products will have intermediary values which need 64 bit precision.

Q2) Determine the longest prefix of a sequence of brackets which can be extended in to a balanced sequence. Return -1 if the entire sequence is already a balanced sequence.

A2) So the question is worded a bit more vaguely than above, but if you’ve had some exposure to bracket systems you’ll know that the description is just equivalent to any balanced sequence of brackets. At first it might seem this problem will be difficult to do distributed because each segment’s validity clearly depends on the one before, but one of the details of a balanced sequence of brackets is that any sequence of brackets can be made balanced, so long as the running sum of open (+1) and closed (-1) never goes negative. Therefore each segment just depends on the running sum of the previous segments.

So each node calculates its total for the local running sum, sends them to the central server which accumulates those running sums and sends back the correct starting total for each node, to that node. Now each node can check whether the true running sum goes negative at any point. If it does, it sends that position to central server, otherwise it sends -1. The central server then returns the first non-negative result. Otherwise it checks if the grand total is 0, if so it returns -1, otherwise it returns the length of the full data set. (This last conditional I think is probably the easiest thing to miss in this problem…)

Since the data is only 1 character per value, memory use isn’t a high risk, and the cost per read of value is low enough that you could probably read each value assigned to the node twice without running out of time if you don’t want to read in to a local array. The usual gotchas around splitting data amongst the nodes apply.

Q3) Given a game where you can move left or right 1 unit or stay still in any given turn, and the rest of the game map moves down one unit after your move. Determine the maximum number of points you can get given any choice of starting locations. Each non-wall containing cell is worth between 0 and 9 points which you earn just by entering it.

A3) The small input can be easily done on a single node, and the solution is a dynamic program on the maximum number of points you can get in total once you arrive at position x. From each row from bottom to top you try each non-wall position, then there are 3 possibilities, if the position in front isn’t a wall max its current value with the sum of the local current value and the value of that cell. Then if the right and right forward diagonal aren’t walls, similar with the right diagonal cell, but also add in the value of the right cell. Similarly with left diagonal. This is seeded by the value of each non-wall position in the bottom row, and the final result is the maximum value in the row above the top of the input. This DP is O(1) per cell, so can be completed very quickly in the small input scenario where the maximum size is 1000 by 1000. The large input of 30000 by 30000 would be almost processed in time by a single node, if it wasn’t for the memory limits and the time to read the raw input.

So the question becomes, how to break the problem up to pass the large input across multiple servers. Horizontal stripes are a bad idea, because the mathematical operation is a max of sums and it mixes values from a lot of locations so it doesn’t distribute well trying to chain it with results that arrive after you’ve started. So the solution is vertical stripes. However the edges of each row of a stripe are dependent on one cell to the right and left of the stripe of the row below. If each server was to communicate its edges with every row, that would be 30 thousand messages, which given the estimated 5ms latency per message is far far far too long to run in time.

In order to send less messages each node needs to do more work. Because the game is limited to one move left or right per row, if we want to only send a message every 300 rows (which brings in a total of half a second expected latency instead of 150 seconds) we need to start with an extra 600 width on our base section, then 598, 596… This doubles the number of values which has to be read by each server if they would usually be dealing with a 300 wide stripe (and doubles the number of DP calculations, but the reading of the input dominates).

In order to continue after we’ve done 300 rows, we need to share the correct values for the full set of 300 neighbours on each side – which we can get from each of the neighbouring servers if we’ve set it up right… Once we’ve got that information we can continue for another 300 rows.

So, does this solution scale well enough? Estimated latency waiting for network is 500ms. Time to ready the values for the stripe and half of each neighbour is 30000*600*0.05us = 0.9 seconds. Actual computation time is much lower. So 5s should be plenty of time. Memory usage is 18MB minimum, but quadruple that if you use 16bit chars and a resizing array without calling reserve. Still fits. In fact it fits well enough you might as well make your life simpler and just use a full 3 widths allocation even though you’ll only use half of each of the outside areas.

Each message sent is 1200 bytes, you send up to 100 messages to 2 servers, this isn’t even a whole megabyte… And the 100 messages is well below the 1000 limit.

Gotchas:

Using your normal logic for distributing the columns to servers can result in server with much less than 300 width. If your code mistakenly decides to send in height chunks equal to the width allocated to the server, you can easily become too message spammy… If you don’t, you need to start requesting results from more than one server over, and again you become too spammy. You want to explicitly allocate exactly 300 main columns to each server to make this simpler. If you don’t use all the servers, that’s okay, it will still easily run in time. More generically (if you don’t like writing your code to the specific problem constraints…) you should divide it in to equal chunks of width depending on the longest of the two dimensions.
One server has a different width, so its neighbouring server gets complicated… Solution is to pretend that the total width is actually exactly a multiple of 300, and fill the extra columns with walls. Makes the code for reading values in to the array slightly more difficult, but makes everything else much simpler…

Q4) Determine the cheapest way to get from a to b if there is a petrol station every km and you use one litre per km, and have a maximum fuel tank size T. Each petrol station has its own price per litre.

A4) So I struggled for a while just to get the logic right for the small input, but eventually I realised its actually quite simple. For each km, the cost you paid for that km, is the cheapest of the previous T petrol stations. So if you keep a running minimum you just add the minimum value for each node as you go.

As I mentioned for the first problem, the sum operation distributes well. A running min on the other hand…

So, the solution is all about calculating a running min. If T is small, this is easy, just do it locally. More specifically, if T is less than the width you allocate to each node, the problem might as well be solved locally, each node potentially doubles the number of values it has to read in, but that is still easily managed against the time and memory limits with some care. Each node calculates its sum and sends it to the central server which creates the grand total.

However there is no such limit on T, it could be very large… However, if T is larger than the width you allocate to each machine, you don’t have to do a running min on that machine, its just a simple min of the values seen so far. Any value added is never removed. However you also need to consider the minimum of some of the values allocated to other nodes, and that you do have to remove values from. But before we get into that detail.. Consider if T is larger than double the width. In that case there is an entire server’s worth of values which will always be in the running min for any given server other than the first. Here we find the opportunity to calculate something distributed which will help the other nodes.

So the first step is for each node to find its minimum price of its allocated section, and broadcast that to every node to the right of it.

So, now for each node the running min consists of 3 (or 4) values of which we take the minimum.

The minimum seen so far in this chunk that we are calculating the sum for.
The minimum of the previous servers which are always completely covered by the fuel tank size T.
The minimum of the previous server which is currently completely covered (but won’t be later once we’ve advanced more).
The minimum of the partial server section needed to fill out T.

The first two of these are easy, the third is easy if its present. The fourth presents a bit of a challenge. I propose that the simplest way is for the local server to read the values allocated to that partial server, in reverse order from the end of that server allocation. As it finds a new local minimum, it records it and its position. Then as you advance pointer you check if the current far minimum is still valid, if not, pop it off the stack and use the one underneath. Worst case this stack contains the entire contents of a servers allocation, at 8 bytes a pop, 4 bytes for the minimum (since they all fit under 1 billion) and 4 bytes for the index.

So, does it scale well enough. Well the worst case tank size is when it causes some servers to create the stack for almost one entire server, then again for another server because advancing the width just drops over into the next server, and in order to calculate that minimum you have to process the entire server’s width. The processing time itself is cheap, the memory allocations will fit so long as you reuse your reserved size stack when switching servers at the boundary… (Should be under 80MB if you do it right.) Network latency is not significant… The big concern is the data reading time. 5 million positions allocated per server, reading time 0.75 seconds just for its own. Triple that and its getting close to half of the allowed time limit just reading the data.

So, it might be good enough… but can we do better? The answer is kind of yes. At the beginning we calculated a single min per server and sent that around. If instead we calculate 100 mins per server, we don’t make calculating the global minimum for the trivially covered section much slower, but it reduces the size of the worst case stack by a factor of 100. On the other hand it increases the worst case number of stacks we have to create from 2 to 101, but because those stacks are all 1/100th the size, we just reduced the overhead from 2 extra servers worth of reading to 1.01 extra servers worth. This saves you almost 0.75 seconds of reading time in the worst case. The reduction to the worst case stack size also makes it much much easier to stay under the memory limit. It increases the network traffic size by a factor of 100, but its still under 100KB sent per node.

Gotchas:

I said it was easy for the case where T is small – but for T around the size of 1 width its actually a bit tricky. Doing a running min the natural data structure is a multiset (an ordered set which allows repetition), but a multiset with 5 million entries can have an an unexpectedly large memory cost, well beyond the 40MB you might expect. Without adequate knowledge of the implementation details of the multiset included in your programming language, you definitely run risk of running out of ram… This can be solved by the same extension proposed to reduce the read worst size in the very large T case. Using the locally calculated 100 mins, the size of T where you need to do a true running min is reduced to only 50thousand elements, which a multiset will easily fit in to memory for.

GCJ 2016 R3

Unlike Round 2 where I think I would have struggled to make the top 500, this round I think I might have done much better if I had been competing. Possibly even top 120.

Advancing to the world finals was definitely beyond me, that would have required solving the large of the problem worth the most points as well as the other problems I consider within my reach, which I struggle to even comprehend how the solution verifier could work…

Q1) Given a sequence of moods which you can either make a request or submission against, and the constraint that you can only submit your last unsubmitted request, determine the maximum score you can get if you get 10 points for requesting in the same mood as a submission, and 5 points otherwise.

A1) The actual problem describes the possibility to get 0 points, but that would require you to request the wrong thing for the current mood and then submit it later also against the wrong mood. Any such scenario can trivially be improved by changing your request to not be the wrong thing, since the later submission would then also not be the wrong match, so you get 10 more points.

So, this problem is a bit deceptive, but a little playing around with scenarios should give you a good guess that greedy is the way forward. If the input sequence of moods has equal two in a row, you’ll do a request submit to get the 10 points. Having removed those, you might now have moods which are neighbouring and equal, so you can align those up and get 10 points there as well. Keep repeating until there is no pairings left. The remains is a simply alternating sequence, there is no way to get 10 points in a simply alternating sequence regardless of the moves you perform. So just take each pair as it comes and get 5 points each.

Having discovered this greedy solution, the only problem remaining is that the large is up to 20k, so an O(N^2) solution is going to be too slow…

As it turns out, its possible to do in linear time. The simplest approach I have for doing this is a bit unusual. It uses a linked list!

Convert the input in to a linked list, then while you have a current node and and a next node, if they are equal, remove those nodes and leave current point to either just before current or just after next if there is nothing before current. If they aren’t equal, advance one. Every time you remove a pair add 5 points to your total. Then add 5 points for half the size of the original input.

A more complicated approach which doesn’t technically require a linked list, would instead involve an array of starts. When you find a pair you repeat outwards to find the largest simple chunk which can be matched up from the inside out. Then you set the start array for the last member of that chunk to point to the first. Then you move on. Whenever you finish creating a chunk, check if its left neighbour has a start value. If so use that value to conceptually O(1) merge this chunk with its neighbour and start considering whether the ones outside that can be made in to a pair. If so, keep going and write a start value once you get stuck again. This basically simulates the process of the linked list algorithm…

Q2) Determine the fraction of strings which contain certain words where the strings are generated by all the possible ways to select nodes of a forest such that you only ever select a child after its parent and each node has a single character label.

A2) This question was probably the one I was least likely to get out even though I could solve it. It just looks too hard…

In practice you just need a small number of scenarios to convince yourself that a simple answer is in fact correct.

Because the required accuracy is only 3 parts in 100, randomly generating 10000 of the strings should be plenty. (The exact detail escapes me, but I seem to recall that if you require an accuracy of 1 part in x for something that has a specific random chance, you need to run x^2 simulations.)

Given the input size is only up to 100, O(N^2) should be ‘okay’ to generate 10k strings if you are quick about it. The trick is just how to ensure that your randomly selected strings are representative.

To work this out, consider a simple case where one node is on its own, and 10 others are in a chain. There are 11 possible outputs, for each of the different possible locations of inserting the one node in the chain. If you were to select with equal likelihood from the available options at any point, half of your generate strings will start with the label of the one node on its own, when it should only be ~9%. The next option I considered was randomly selecting a node and adding it and all of its parents. However if we do that the probability of the one node on its own being the last value is far higher than 9%.

The third option I considered was, weighting each available node to select from, by additionally including the count of all of its children. This means the first selection is 10 parts the chain, and 1 part the node on its own. Which gets us the correct percentage. Following all the way through the options shows it generates things correctly for this scenario. Which is promising. We already know it also generates the right values for a forest entirely of single nodes. So that is two data points in its favour… I tried one more scenario to convince myself. A single node and a simple 3 node tree. The tree has 2 orders, and the single node has 4 insertion points, this gives 8 possible scenarios. Of those 8 scenarios 2 start with the single node. Again the weighted selection works 3:1 corresponds to 8:2. And walking through the rest of the scenarios shows the percentages work out.

Good thing about this question is that its just a high valued small input, so if this weighting thing is wrong, we can find out pretty quickly… but its just fine as it happens.

So calculate the weights of the tree, then add the roots, randomly select based on total weight, remove that node, add its children, repeat. This is O(N^2).

A slightly nicer to implement solution is to recognise that the weights are effectively a place holder for allowing random selection of any node and then putting the deepest not yet placed parent of that node instead of the selected node. This way you don’t have to do any tree constructing. Just have a boolean array representing what has been selected so far, generate a random number the size of the remnants and walk to find the nth not selected item. Then walk its parents while they exist and aren’t selected. Finally select that node and then repeat the process until all nodes are selected.

Q3) Determine the smallest maximum jump size to get from one asteroid to a second asteroid, if there are a bunch of asteroids moving at linear speeds and you can jump between any two at any moment so long as you don’t stay anywhere more than S seconds.

A3) This problem reminds me of another I’ve done before, but that was in 1 dimension, this is in three…

However, the small input is easy, since all the linear speeds are zero. Therefore there is no point to waiting anywhere. Therefore it just becomes a question of for a given maximum jump size x, is there a path between asteroid 0 and asteroid 1 then solving for the smallest maximum jump size x. Once a maximum jump size is set, the problem reduces to a simple search (DFS is one option). To solve for smallest, you can do a binary search.

Contest analysis isn’t up yet, and I’ve not ready anyone else’s solutions, so I don’t know how to do the large. I suspect it has a similar structure. Once you have a set maximum jump size you can determine the time ranges that a given edge is open or closed. This appears to cause a combinatoric explosion if you clone each node based time range that a given combination of edges are open/closed with edges between the clones depending on the size of S, but maybe you don’t need to, you can instead just have a separate clone per status of a single edge, connected to clones representing different arrival time ranges and the separate arrival and departure clones are connected depending on the times and size of S… I don’t quite think it works though.

Q4) Given a simple single bit single register computer and two programs that run simultaneously made of three atomic instructions, set 0, set 1 and print register, determine a program pair which can potentially print any of the ‘good’ values, but never prints the ‘bad’ value.

A4) Again I don’t know how to solve the large (at least I don’t think I do…), but the small is surprisingly trivial.

For the small input the bad value is always a sequence of ones. So as long as the bad value isn’t also a good value (which is something to just check for in general…) one option is to write a program which can print any sequence of X digits except for all ones. One way to do this, is to have one program that continuously sets the register to 0, then prints the register, x times. Then the second program consists of exactly X-1 calls to set the register to 1. Because of arbitrary interleaving, this program pair can print anything except for X 1’s. Since the first program always sets to 0 before printing, there would need to be an interleave before every print, but there are only X-1 instructions from the second program to interleave.

This is suggestive of possible approaches for the large, but the open question in my mind is whether they cover all possible inputs correctly, or just a subset. For instance an input where the bad value is all zeros, can be solved using an inverted version of the program. If the bad value has one one digit, and all the good values have either more or all less than one one digit similar constructions work. In general I can solve if the number of set bits is either strictly greater or strictly less than the bad value, but not otherwise. For specific cases if there is only one good value, I can solve that too… If there are multiple good values and they have x set bits in common, but the bad value ‘further away’ from the commonality then the good values or doesn’t have all of those values set. That can be solved too. Or inverted for x unset bits in common.

In general the things I can solve are where one program does all the printing, and the second program is a sequence of ones or zeros of some length. Its not clear to me that such a program pair is always the solution if there is such a solution, but it doesn’t seem unlikely…

Distributed Code Jam 2016 Round 1

So this year, the structure is a bit different 2 rounds before the grand final for distributed and more people eligible to compete in DCJ.

As usual I’ve been busy, so Round 2 is almost about to happen, I figured it was time to write up my analysis of Round 1.

Unlike round 2 of the main code jam, I think I would have had a chance of advancing to round 2 of DCJ this year if I was competing. I usually say there is practically no chance of me writing up 4 solutions in 3 hours, but this round the questions seemed pretty easy (except for the large of the last one), so I think there would have been a real chance of me getting a top 200 placing.

Q1) Given some inefficient source code, write an equivalent, but efficient version.

A1) So the detail is of course in the provided source code. It is of course a bit convoluted, but the net result is its calculating the largest difference between any two values in the input.

Of course the best way to do that is to calculate the global maximum and global minimum and take their difference. The large input there is too much data to do this all in one node (of course), but if you break the input in to approximately equal chunks per server, each server calculates min/max – sends to central server which calculates the min of mins and max of maxs. Then the difference of those two is the largest difference and that central server can print it out.

I suspect the biggest chance of failing this question is in breaking up the input incorrectly. Its a very common task in DCJ, you always want to keep the code around to do this, you need to handle cases where there is less input then there are nodes, or if you try to allocate equal size to each node that some nodes won’t get any. If you want to get tricky and try dividing up the input by modulo rather than blocks (which can seem easier) remember to write your loop using += modulus rather than checking every number against the mod. 1 billion divisions is going to time out…

Another risky move would be to read all the values you need into memory rather than just calculating the min/max on the stream of values as you read them in. This should only take 80MB of ram per node in the worst case, but if you are using a doubling resizeable data structure like vector, you will be very likely to hit 128MB instead, maybe even higher… Reserve the required space to avoid this.

Q2) Determine who wins a giant rock paper scissors competition if everyone always plays a specific move, and in case of draws the person with lower id wins.

A2) The large input size is only 2^28 unlike the 1 billion for the first problem, but its still easily large enough for you to time out if you tried to read all of the values on a single node. Breaking the input up in to approximately equal chunks across all servers is not a good strategy this time, a much better option is to break it up into a number of exactly equal chunks which is the largest power of 2 which is less than the number of servers. This means each server is running its own isolated competition with a single winner. Each single winner can then be passed to the central server and it can continue from there.

This one doesn’t seem very risky so long as you break your input up by powers of two. Only 4 million characters, you can read them all in to memory, and even not bother reserving, or reusing vectors as you run the contest (even if you happen to be using 16bit characters…). Run time should be pretty fast too.

Q3) Determine the minimum number of moves to restructure a bunch of piles of crates to have the specified almost uniform height across all piles. (All equal, or one higher for some of the earlier stacks if the total isn’t exactly dividable amongst the number of stacks.

A3) Definite step up in difficulty here. The contest analysis describes one approach, I’m going to describe a slightly different one which I think is a bit simpler logically, although its a bit more difficult to code to be fast enough…

First phase we need to know how high each stack should be. To know this we need to know how many crates there are in total. So every node takes a chunk and sums it up and sends to central server which sums the sums, and then rebroadcasts that grand total back to everyone…

The core trick the problem is that because the crates can only be moved one at a time to a neighbouring stack, the problem can be solved left to right one stack at a time. First stack knows how high it should be, how high it actually is. If its too high, k moves to move the excess to the next. If its too low, it needs to steal from its neighbours to the right, so walk right stealing until you have enough. Steal from the closest first, and as you get further away you each steal costs multiple moves to move through all the intervening stacks.

For the distributed version each node needs to solve in isolation to be fast enough, so at some point you are going to dump a bunch of crates on to the next server, or you are going to need a bunch of crates from the next server. For the purposes of stealing once you get the edge you just assume the other server will have moved them in to your last stack and you can count the cost of moving them from that stack to the right locations.

So, since each server will need to know the flow at the serve edge to calculate the correct value, the central server might as well calculate that while its calculating the grand total. Since all it needs to work that out is know what the desired height total for each server is, and the total. Since it just worked out the grand total, the desired height total for each server isn’t much harder, with the one corner case of the server where the desired height changes. Then going from left to right comparing the sum to the desired total it can work out the flow to the next server, and assuming the flow and next sum and compared to the next desired total it can determine the next flow after that….

It then shares the flow to server before and server after to every server. The initial and final flows are of course 0.

Each server node then can process left to right. First deal with the flow. Positive flow in to this node you just add to the corresponding edge node. Negative flow you start the standard stealing process to calculate. Remembering to track the cost.

Once the flows are fixed you can just solve the problem locally given the desired height calculated from the grand total.

Gotchas to watch out for.

Stealing from neighbours can cause you to run too slow if you try to check the same set of neighbours over and over. You need to keep track of where you have already picked clean, and start one right of that any time you need to steal to ensure that the local calculation part is linear in the size of the section assigned.
Even though each stack size is only up to 1 billion, the flows calculated can be in the hundreds of millions of billions. Therefore you must use 64 bit entries. Given each node is dealing with up to 10 million entries, this means you must reserve to avoid the risk of hitting 128MB.
Because of the large flows that can travel across nodes, when stealing the cost calculated from a single steal can overflow 64 bits. Either do the product with 128bit types, or (more sensibly) ensure that you mod the number of crates being moved before you multiply by the distance moved.
Mod everywhere – the product I mentioned above is the biggest individual risk, but there are plenty of other ways to overflow if you don’t remember to apply the mod during the process and not just at the end.

Q4) Determine the smallest positive integer in a set which is unique.

A4) The small can be run on a single node, and is easy to write. Just read the values in, sort them, return the first which is not equal to one of its neighbours. Or zero if you get to the end of the array.

The large is only 35 Million, which if it wasn’t for the memory limit is almost able to be processed in the 4 second time limit… Well, maybe it would need 7-8 seconds…

The contest analysis solution is much safer, rather than trying to do a distributed merge sort, break the problem in to chunks and then combine the results. Unlike the other problems where the chunks are segments of the input stream, (or stripes if you like for the first problem) the appropriate chunks depends on the input. In order for each section of input to be processed independently, we need to be sure that all duplicates of any given number are processed by the same node. This can be achieved by a hash function. Each node processes a segment, and after locally separating the dupes from the uniques, forwards values to other servers based on their hash. Each server then receives a two sets of values from other servers, the uniques and the dupes. Since it now knows all copies of a specific value are together, it can make a call as to the lowest unique which has a specific hash assigned to this server. It sends these locally lowest definitely globally unique values to the central server, which can then calculate the global lowest.

With the relatively small input size, if you are using all 100 nodes it would seem that there is no risk of running out of ram following the general strategy described above. But if you don’t reduce the input sent to specific servers by hash down to the two sets and instead send all the values, a single server can easily be overwhelmed by the input case where all the values are the same. If your hash is terrible (ie mod 100), similar problems can be had. But if your hash is decent, and you do correctly filter the data down before sending it on, it should be pretty safe.

GCJ16 R2

Busy as usual, so this analysis is quite late. I’m not sure I would have advanced if I was competing in this round, the top 500 cutoff was both the first two questions and one of the other smalls. The first two problems weren’t especially challenging, but writing correct solutions to 3 problems (even if one is a small) in the time available is not really something I am reliable at.

Top 1000 for the t-shirt was much more managable, the first two problems, or even failing one of those larges you pass if you have one of the other smalls as a backup.

Q1) Determine the minimal ordering for a set of rock, paper, scissors players such that there will be no draws in the resulting tournament. Or return impossible if such a scenario doesn’t exist.

A1) If you solve the small version by brute force you can see that for any given total number of players, there are only 3 possible scenarios in terms of the number of players who like each of rock, paper or scissors. This makes a lot of sense once you try and build the problem from the end result. The tournament either ends in a winner who plays R or P or S. For each you know that they had to have played the type which they can beat. This continues until you get to the right total number of players. The trick is to determine the minimal ordering. A mistake I made was to assume that if when you are generating the set of players you always choose to generate the next level in alphabetical order where possible, the final result will be optimal. If you don’t have sharp eyes you might miss that this doesn’t generate the same result as your brute force solution. Moral of the story, use an actual diff to compare the output of your brute force and optimal solutions…

An example of how this is wrong is if you start with R – next is RS, third tier is RSPS, but the ideal is PSRS. The solution is to perform a more general optimization step. Rather than sorting just the neighbours, sort the pairs, and then quadruples, the octuples, etc… You might as well just do this at the final stage, unless you’ve decided to pre-generate all the answers for all the different lengths, which is easily managed for the given constraints. Such a sort maintains the tree structure of the competition, but (fairly obviously) minimizes the ordering.

Q2) Given a set of probabilities of voting yes or no, determine a subset of size K which maximises the chance of a tie.

A2) I think this question is actually quite tricky, but if you do the brute force for the small input you can pretty easily see a pattern. The extremes of the probability spectrum are the best choices. Having made that observation you might try a few options, like greedily adding either the highest two, or the lowest two, or one high and one low, whichever is best. This doesn’t work. What does work is trying every combination of size K which has the N minimum values and K-N maximum values. I don’t really see how you could come to this solution except by observation My intuition mistakenly suggests it should be the middle ones.

However that isn’t the whole problem – for a given set of K probabilities, you need to determine the probability of a tie. Even for the small input, brute forcing this is unreasonable. However, this sub-problem is a straight forward dynamic programming problem. Given x people voting yes or no, there is between 0 and x yes votes. So you can have a dynamic program on the probability of the first x people having voted y yes votes. Start with 0 people having 0 votes as probability one. Then use f(x,y) = f(x-1, y)*P(x votes no) + f(x-1, y-1)*P(x votes yes) to fill in the half triangle of the K by K grid of scenarios and return f(K, K/2).

Q3) Given a grid which is to contain a hedge maze constructed of diagonal walls, determine if (and if so how) to construct the hedge such that there is a specific set of paths between edges where these paths never join or cross.

A3) The first observation is that the diagonal hedge maze structure can only create paths that are linear, so you don’t have to worry about the joining or crossing part. You just need to concern yourself with the paths getting to the right edge destination for each possible starting edge location.

The small brute force is small enough you can construct every possible hedge maze and then trace the paths. The main difficulty with this O(2^(R*C) * R * C) solution is tracing the paths. One option (which I investigated) is to store current location as cell x,y and whether its the top or left edge of that cell, with virtual cells beyond the edges to do the bottom edges and right edges of the bottom and right most cells. Then given that edge and diagonals in the neighbouring cells, determine the two possible new locations. Rule out the one which is either outside or the same as the previous location you were at. This works, but its far more cumbersome than the method suggested in the analysis, which is to store position as cell x,y and which of the 4 directions you are entering this cell from. By increasing the size of the third component from 2 to 4 options, you get rid of the need to possibly generate two locations, or track where your previous position was.

As an aside, these kind of grid problems which work with edges are something I feel I could do something nice about in TMD.Algo – hopefully when I get some time I’ll make such a convenience class.

The large can’t be brute-forced, but some iterative thinking comes to the conclusion that it can be solved greedily. If two neighbouring edge locations need to join, the minimum number of cells their path can go through is those exactly neighbouring them. And such a path can always be constructed. If you want to connect things which are further apart, the optimal solution is to hug the wall of the paths of things closer together which have already been connected. This wall hugging is clearly in some sense minimal. Its compactness means it maximises the region remaining for other paths to be constructed through. Thus the conclusion that this greedy wall following will either work or the problem isn’t solvable.

I feel that writing the code to add walls to create the wall following is something I would have found awkward and difficult under a time constraint, especially with my x,y,(left or top) position notation. However the contest analysis makes the excellent point that the wall following is either flowing through a cell which already has a wall, or you want to turn left if you are going between two locations that have shortest length in the clockwise direction. With the x,y, (direction of entry) formulation, its very easy to determine what wall you have to add in order to turn left.

Q4) Given a set of people who have a different skill set, determine the minimum number of additional skills to give to some of these people to ensure that regardless of order of arrival and selection of tasks to work on (from those not already taken), everyone ends up with something to work on. (Number of tasks and their corresponding required skills is equal to the size of the set of people.)

A4) So this question deceived me in to think it was easier than it actually is. Some reasoning about the problem comes to the conclusion that if two people know the same skill, they are ‘connected’, and if a group of people are all somehow connected, they must be fully connected, or there exists some arrival order which is wrong. From that I jumped to a completely incorrect solution of computing the connected components of people and just summing the minimum number of skills needed to make those connected components fully connected. This doesn’t even work for the samples included… Since it doesn’t necessarily ensure that everyone knows anything at all.

The small input size is only 4 people, so there is only 2^16 possible skill set scenarios. The criterion for whether a scenario is valid is that for each column there must be at least one bit set, for each row as well. And if there are more than one bit set in a row, those columns must be identical and vice versa. Choose the ones which are a superset of the starting details, and find the one with the minimal set of differences. That minimum is the answer.

The large input is far from being susceptible to brute force. 2^50 skill set scenarios. However, greedy approaches like my first attempt are equally problematic. It comes down to avoiding the full 2^50 explosion while considering every possibility.

The first key observation which is given by the contest analysis is that a connected component which has p people who know q skills, it doesn’t matter how they are connected or what the specific labels of those people or skills are. The final answer is groups of x people knowing x skills with x^2 edges, the sums of these x^2’s minus the original number of edges needs to be minimised, but that answer isn’t affected by any of those details.

The second key observation is that a connected component that has p people knowing q skills is further interchangeable with any other component with the same number of people and skills.

So a given scenario consists of counts of distinct pairs. The number new of scenarios that can be generated is quadratic in the number of distinct pairs, but it will always have one less total number of pairs. So if we consider every possible scenario generated so far of total number of pairs k, use it to generate new scenarios with k-1 pairs, and just keep generating from the distinct results. If you see a scenario with only balanced values, calculate the sum of squares to be a new potential minimum. This approach is a bit different to as described in contest analysis, and may not actually run fast enough – I’ll need to check… Writing the appropriate hash functions for a multi-set of pairs would be interesting.

GCJ16 R1C

So, I’ve been a bit busy, so I didn’t get around to writing this up earlier.

Quite a tricky third problem with two problems that were reasonably easy left the cut-off being quite a fast time on the first two, or at least solving the small of the third.

I was amused to read the name of the person who came 1005th. ‘i.will.get.a.t-shirt’ missed out on getting to round 2 by 15 seconds…

Q1) Determine how to reduce a set of counts so that no specific member has a majority at any point, up to 2 can be removed at any point.

A1) This problem has two parts, the concept that avoiding a majority comes by reducing the largest count, and the realization that if there are only 2 non-zero counts they must be equal, and hence you have to remove one from each or a majority is formed. From there its straight forward.

Q2) Given a set of nodes, create a directed acyclic graph which has exactly M paths between 2 specific nodes.

A2) For me this problem was pretty easy, but I could see it would be easy to get stuck thinking about it when you should be investigating to discover the trick.

My initial thought was recalling that its possible to number the nodes in a directed acyclic graph such that every node strictly points to nodes with a larger value. The scenario with maximum number of paths would therefore be the one which maximizes the number of edges while satisfying that all edges are forwards. So node 1 connects to 2 through N, node 2 connects to 3 through N, etc.

The number of paths in such a graph can be calculated easily enough – set 1 to node 1, then for each node in ascending order, sum all the values associated with nodes that connects to it, which is all of them seen so far. This gives the sequence 1, 1, 2, 4, 8, 16, …

At this point the trick is to realize that the final node is connected to all the previous nodes in this fully connected graph, hence why its value is 2^(N-2), if we were to break one of those links, its value would be reduced by the corresponding value of the node it was linked too. Obviously if M = 2^(N-2) we need all of them , but otherwise we’ve got all the smaller powers of two and we just need to connect them based on the bit pattern of M. Or, just start from the largest and if its smaller than M, connect it, and reduce M, otherwise don’t connect it.

The final result is a triangle matrix, excluding the connects to the last node, which is either all of them, or a 1 shifted copy of the bit pattern of M.

Q3) Determine how to repeatedly choose from 3 separate piles of distinct items such that you never choose the same the same set of three more than once, and that you never choose the same subset of two more than K times.

A3) Its very tempting to just greedily choose the next possible combination while it doesn’t violate the conditions – but it can easily get stuck. This explains the large number of failing attempts on the small input, or so I think anyway.

To try and understand this problem I implemented the brute-force for the small input. Despite what it suggests in the contest analysis, it is possible to (with some care) write a brute-force solver which runs in time. You do have to be sure not to allocate any memory in the loop…

I didn’t really find my understanding of the problem significantly improved by being able to compare the basic greedy to the correct brute-force, but it did manage to reinforce my thought that greedy was the right technique, I just needed to tweak it to avoid getting stuck. So, I tried selecting from the third pile in reverse, alternating reverse every second time, and finally (I thought somewhat at random) by using a rotating offset driven by the selection of the first two piles.

This final option worked, much to my surprise – although now I’ve read the contest analysis it makes much more sense – the rotation clearly ensures it won’t get stuck.

GCJ16 R1B

So, with 1000 having already advanced there probably wasn’t quite so much competition this round, but it definitely seemed to be a slightly harder round, with only 320 perfect scores. Specifically the third problem large input was a bit difficult if you didn’t have experience with that kind of problem before.

In the end the cutoff was the first 2 problems in a decent time.

Q1) Given a set of characters made out of the the letters in the English words for the digits zero through nine, determine what the original digits were, in non-descending order.

A1) This was a fairly straight forward, if potentially tedious to code problem – you just have to identify the correct order to try removing the words from the distinct count of each letter in order to avoid getting into a dead end. One such order is (Z)ERO, T(W)O, FO(U)R, SI(X), EI(G)HT, (O)NE, (T)HREE. (F)IVE, SE(V)EN, N(I)NE, . Each of the letters in brackets is the last time that letter occurs reading left to right, so removing in that order obvious works. I like this particular ordering because its the even digits followed by the odd digits, each in ascending order.

Q2) Given a pair of numbers that have been obscured by replacing some of the digits with question marks, and may also be zero padded, determine the way to replace the question marks which results in the minimal difference, breaking ties by minimizing the first value and further by minimizing the second value.

A2) So the contest analysis mentions that this can be done in O(N), which I found interesting, but I’ve not managed to solve yet. The O(N^2) solution is consider the replacement of question marks to have 3 phases. Phase 1 you try to make the numbers equal, Phase 2 you introduce a minimal single digit difference, Phase 3 you try to maximize the difference in the opposite direction to the difference introduced in Phase 2. You just iterate over all of the possible places for Phase 2, including not every reaching it. For each possible phase 2 location you can either try and make the second larger, or the first larger, but that only doubles the total number of passes, so its still O(N^2). You can try and optimize further by aborting when phase 1 fails to make equal, but its not trivial to know when not to bother trying with phase 2 – I think you need some kind of pre-processing step to work that out in order to get to the O(N) solution.

Q3) Given ordered pairs of words, determine the maximum subset which can be made entirely out of the first and second word sets of the inverse of that subset.

A3) I immediately recognized this as a graph problem. I mistakenly first assumed that the answer was number of edges minus the minimum spanning tree, but the minimum spanning tree was the wrong concept. Not too long after that I realized it was a bipartite graph problem. With that I simply assumed that TMD.Algo had the solution and that the answer was the number of edges minus something to do with calculating the bipartite matching. In the end its number of edges minus the bipartite maximal independent set. Which doesn’t make a lot of sense at first glance, since you want number of edges minus the minimal edge cover size – but for a bipartite graph the maximal independent set is equal to the minimal edge cover size. The minimal edge cover is equal to the size of the maximum matching plus one for each vertex not in the maximum matching. Number of vertexes not in the maximum matching is to number of vertexes minus twice the maximum matching. Add on the size of the maximum matching and you get number of vertexes minus the maximum matching, which is the same formula as the size of the maximum independent set.