February 2013 – The Missing Docs .Net

TCO13 R1A

So pre challenge phase I was on 531st – with the feeling that things could go either way for me – the 500pt had the opportunity for lots of failures because of floating point error margins, which I was confident I had taken care of – but on the other hand, a small missed optimisation I noticed with 15seconds to go, left me uncertain whether it was possible to write a degenerative test case which would cause a time-out.

After challenge I was 480th, with a bit more confidence having seen someone successfully challenge 2 others but not my solution, using what I am pretty sure would have been a time-out attempt.

My solution to the 500pt did fail system tests, but in-spite of that my final placing was 471st. There were a very large number of failures for the 500pt problem. In my case it was a degenerative performance test case, as I expected.

471st gets me through to round 2, no need to slog through R1B and/or R1C. R2A/B/C happen after daylight savings change, so I think they will be at 2am instead of 4am. I am not sure which is worse really…

The 250pt question was easy, and really so was the 500pt – but it took me over half an hour to realise the obvious aspect of the 500pt problem I was missing. I blame having to code at 4:30am… The 1000pt was not – but still there was ~80 submissions for it.

Q1) Given a set of numbers in the range 0-9, what is the minimum cost (where it costs 1 per integer distance a number is moved by) to ensure all the numbers are within 1 of each other.

A1) The limited range made brute force trivial – just try pair of adjacent numbers (0,1), (1,2) … (8,9) and calculate the cost to move each number below to the bottom number and each number above to the higher number. I managed to score 240 out of 250 points here, which is pretty high for me – but it was a pretty easy problem… Competition advancement cut-off was 234 points – so I didn’t want to be too much slower though!

Q2) Given a set of ‘gaps’ which start and end on integer boundaries (but are exclusive of the boundaries themselves) determine the smallest number where all multiples of that number do not fall into any of these gaps.

A2) I saw several solutions to this, and many of them failed. Most of them used floating point numbers for doing most of their code, which was a big risk due to cumulative errors. Others failed (like mine) because the code didn’t scale well enough.

The key point which took me far too long to realise was that any candidate solution can be improved so long as none of the multiples lie on the end point of a gap. Otherwise a microscopic reduction in jump size will still leave every multiple not hitting a gap.

This leads to the solution that the number is a divisor of one or more of the end values of a gap. So for each end value we check each plausible divisor to see if it works for the entire set of gaps. You can set an upper bound for the divisor by calculating the floor of right edge divided by width – but if the width is 1, this upper bound doesn’t help much. Checking the divisors I kept everything in integers, basically doing simple fraction math rather than using floating point. However I didn’t notice that my cost of checking whether a divisor worked or not wasn’t O(N) where N is the number of gaps but was instead O(D) where D is the maximum distance from the start to a right gap edge. Since the upper bound for D was 30k and N was 50 – this was a significant performance penalty. The algorithm is O(N^2D) if implemented correctly, but mine was instead O(ND^2). O(N^2D) was fast enough to pass tests. I saw an O(ND log ND) implementation which passed, but I don’t actually understand why it works…

The mistake I made which caused the O(D) performance when checking if a specific length worked was that I incremented the multiple counter by 1 while the current location was in front of the current gap. If the first gap was very late, and the gap size under consideration was 1, it would tight loop increment that counter all the way up. I needed to either use a simple division short circuit to accelerate to the greatest multiple before the current gap, or instead use a modulus check to see if the jump size would end up in the gap rather than walk my way up at all.

Q3) Given a rectangular grid which is cyclic in nature (left edge joins to right edge, and top edge joins to bottom edge) where each cell in the grid specifies a direction (either left, right, up or down), determine the minimum number of cell changes required such that every cell is part of a cyclic path which comes back to itself.

A3) This problem is a tough one. (The system tester thought so as well, it refused to rate anyone on this problem the first time round :P) I heard one competitor saying it was a case of min cost max flow – and I think I see how that could be the case, but min cost/max flow is a very complicated algorithm to try and implement during the competition.

The first level of analysis on this problem is pretty straight forward – if every cell has another cell pointing to it, then every cell is in a cycle. Correspondingly every cell which has multiple cells pointing to it, has to have changes occur to reduce it to one. The problem is how to minimise the number of changes required to get each case of excess inputs redirected to a location of insufficient inputs.

A bit of thought has failed to help me construct the appropriate min cost max flow graph to solve the problem, but it is after 7am – it is time to get some more sleep…

Post sleep – I can see how to formulate this as min cost circulation problems. Each node is duplicated, with a minimum and maximum flow of 1from copy 1 to copy 2 (and a cost of 0) – then each copy 2 connects to copy 1 of its neighbour with maximum flow of 1 and a cost of 1 if it isn’t the neighbour the node currently points at. One special node is added, post source which connects to every copy 2 with maximum flow of 1 and cost of 0. Source connects to post-source with maximum flow equal to target cycle count and cost of 0. All copy 2’s connect to sink with maximum flow of one and cost of 0.

The problem is then repeated for each target cycle count between 1 and V/2 – and the minimum of minimum costs is the result.

The problem is the standard min-cost max-flow algorithms don’t support minimum flows. One solution in this case where minimum flow equals maximum flow is to make it ‘cheaper’ to go through every minimum flow edge then any other path that doesn’t include them. To do this we change the edges with minimum flow to have no minimum flow but a negative cost greater than the number of edges in the original graph – this ensures that minimum cost will pass through every one of these edges. In this case the shortest augmenting path algorithm has to be one which handles negative cost edges.

One refinement – not sure if I’ve considered this before or I am just starting to understand better for the first time – but in this case you can do all of the V/2 problems at once by setting the target cycles to infinite and keeping track of the cost after each augmenting flow. Since each augment is minimum cost and increases flow by 1, it will pass through every cycle count and their associated minimum cost.

Surprise top coder…

So it appears I wasn’t paying enough attention to my email back in the middle of last year. Unlike usual when top coder announces it is opening registrations a couple of months before the contest starts, this time it decided to open registrations the middle of the year before!

So I got a reminder email today, registration is open and surprise – round 1A is this weekend. Guess I will need to quickly muster my ability to stay awake and conscious at 3am in the morning once more. I should probably start trying to do these competitions in c++, but with this little notice I think I will be using c# again this year.

Since I had been surprised by the sudden appearance of the top coder open – I went and double checked – GCJ13 registrations don’t open for another 3 and a half weeks. Not that I will be eligible to compete this time!

Finding a gap in an ordered sequence of distinct integers presented in random order

Unrelated to my recent interviews at Google, I saw the following interview style example question.

If you are given a random sequence of at most 4 billion 32bit integers find a 32bit integer that is not used. (There must be one, why?) Answer this in 2 parts – if you have unlimited memory, and if you have limited memory but plenty of disk storage options.

This problem presents itself to me as a case of the pigeon hole principle and the associated ‘counting sort’. 32bit integers have 2^32 options, which is a bit less than 4.3 billion – hence by the pigeon hole principle, 4 billion pigeons cannot fill 4.3 billion pigeon holes. Unlimited memory means you can allocate the 16GiB needed for a proper counting sort – but since we don’t actually care about repetition counts in answering the question of what number is missing – you can instead allocate a 512MiB bitset and simply mark every observed number as a 1 bit, then search said bitset for a 0 bit. And if your disk seek times are not a huge disaster, the bit set can be stored on disk instead. This solution is O(N) in a sense, if you ignore the cost of allocating the bitset. Obviously the limitation on size of input originally technically makes it O(1) with a massive constant – but that is a pretty non-useful viewpoint.

Usually I would have probably stopped there and not considered the question further (Except to mention that an O(N log N) solution can be done using sequential disk access and in this case log N is at most 32, so sequential read/write of integers vs random access at the 1 bit size is almost certainly going make the O(N log N) solution perform better in practice) – but for some reason the fact that we have the fixed ‘constant’ time of allocating the bitset was sticking with me. This morning I came up with an O(N) time algorithm to find a gap in the ordered sequence of integers without repetition presented in random order – with no 32bit restriction. (And as a bonus, it acts like the O(N log N) algorithm in many respects, so a restricted memory implementation would get the performance advantages of sequential access to disk.) Obviously the original problem doesn’t have the restriction of no repetition and removing duplicates in O(N) uses hashing – which gives random access again and non O(N) behaviour if hashing is poor – but I think the resultant problem is still interesting.

The algorithm isn’t very complicated and probably of limited use, but I don’t often come up with something I haven’t seen before. It uses as its basis the O(N) running time minimum, maximum and median algorithms. The O(N) median algorithm has a fairly high running time constant of implementation – but it is only needed to be used if you want guaranteed worst case time bounds – see at the bottom for how to adjust the algorithm to give randomised expected O(N) time with a much lower running time constant.

The algorithm is recursive with inputs of ‘integer array’ and min/max of range to find the gap in. In the original problem min/max would have been 0 and 2^32-1 (for unsigned integers), but if you are explicitly interested in gaps you can start with setting min/max to the results of running minimum and maximum on the input array.

From there the trivial algorithm is as follows – if min of range is greater than max of range – return ‘no gaps’. If input array is empty – return min of range. Otherwise find min/max/median of input array – if min > min of range – return min of range – if max < max of range – return max of range. Otherwise filter input array to values between min and median (exclusive) or median and max (exclusive), which ever range is has lowest density – if equal choose one. Then recurse using this filtered array and its corresponding exclusive range.

Performance analysis is trivial – each recursion reduces the number of elements considered by half (since the median is the halfway point), and each step takes time proportional to the size of the current array – leading to the N+ N/2 + N/4 … worst case which gives O(N).

Correctness depends on there being no duplicates – obviously whenever it returns something other than ‘no gaps’ it has found a gap, but it is the no duplicates which lets us say that it will only return ‘no gaps’ if there are no gaps. Because there are no duplicates, any given range in the sorted sequence either has no gaps (which gives a density of 1) or it has gaps (which gives a density less than 1). When we divide the sequence into the groups smaller/larger than the median and calculate the density of each – all we really care about is whether the density is 1 or not. Since the no gaps case always has a higher density than the gaps case, we will always recurse into a side which contains gaps – unless both densities are 1. (Indeed trivial extension of the original algorithm above is that if both densities are 1, return no gaps rather than recursing.) If we do recurse into a side which has no gaps, we will keep recursing until the input contains 2 or less consecutive numbers – the next recurse will then consist of an empty array and a max less than min – which is the case we return no gaps.

The choosing of lower density is simply an average time performance improvement – if the input set is biased, the lower density section is more likely to terminate in less recursions. Technically that step can just be ‘choose a side with density which is not 1’.

As an addendum – this algorithm is a bit slow if guaranteed worst case O(N) is not required (because of the cost of the guaranteed O(N) median finding algorithm) – you can drop the median requirement entirely, and just choose a random element as the pivot to filter the array. This still gives expected running time of O(N) based on the randomisation – but it can degenerate – and in the worst degenerative scenario you will need to deal with an extra case of a 0 length filtered array with 0 width range – which should be treated as if it has a density of 1.

The not so Random random

Just a quick one here – while I was working on some graphs for a suggestion I made for LOTRO. I discovered a peculiar anomaly in Random.NextDouble.

Plotting a graph of the length of runs where NextDouble returns greater or equal to 0.01, there is a significant dip at a run length of about 53. Switching to RNGCryptoServiceProvider makes this dip disappear and the dip is clearly reproducible even with sample sizes in the hundreds of millions.

Quite bizarre – although not quite as bizare as this deviation mentioned at stackoverflow.