Unrelated to my recent interviews at Google, I saw the following interview style example question.
If you are given a random sequence of at most 4 billion 32bit integers find a 32bit integer that is not used. (There must be one, why?) Answer this in 2 parts – if you have unlimited memory, and if you have limited memory but plenty of disk storage options.
This problem presents itself to me as a case of the pigeon hole principle and the associated ‘counting sort’. 32bit integers have 2^32 options, which is a bit less than 4.3 billion – hence by the pigeon hole principle, 4 billion pigeons cannot fill 4.3 billion pigeon holes. Unlimited memory means you can allocate the 16GiB needed for a proper counting sort – but since we don’t actually care about repetition counts in answering the question of what number is missing – you can instead allocate a 512MiB bitset and simply mark every observed number as a 1 bit, then search said bitset for a 0 bit. And if your disk seek times are not a huge disaster, the bit set can be stored on disk instead. This solution is O(N) in a sense, if you ignore the cost of allocating the bitset. Obviously the limitation on size of input originally technically makes it O(1) with a massive constant – but that is a pretty non-useful viewpoint.
Usually I would have probably stopped there and not considered the question further (Except to mention that an O(N log N) solution can be done using sequential disk access and in this case log N is at most 32, so sequential read/write of integers vs random access at the 1 bit size is almost certainly going make the O(N log N) solution perform better in practice) – but for some reason the fact that we have the fixed ‘constant’ time of allocating the bitset was sticking with me. This morning I came up with an O(N) time algorithm to find a gap in the ordered sequence of integers without repetition presented in random order – with no 32bit restriction. (And as a bonus, it acts like the O(N log N) algorithm in many respects, so a restricted memory implementation would get the performance advantages of sequential access to disk.) Obviously the original problem doesn’t have the restriction of no repetition and removing duplicates in O(N) uses hashing – which gives random access again and non O(N) behaviour if hashing is poor – but I think the resultant problem is still interesting.
The algorithm isn’t very complicated and probably of limited use, but I don’t often come up with something I haven’t seen before. It uses as its basis the O(N) running time minimum, maximum and median algorithms. The O(N) median algorithm has a fairly high running time constant of implementation – but it is only needed to be used if you want guaranteed worst case time bounds – see at the bottom for how to adjust the algorithm to give randomised expected O(N) time with a much lower running time constant.
The algorithm is recursive with inputs of ‘integer array’ and min/max of range to find the gap in. In the original problem min/max would have been 0 and 2^32-1 (for unsigned integers), but if you are explicitly interested in gaps you can start with setting min/max to the results of running minimum and maximum on the input array.
From there the trivial algorithm is as follows – if min of range is greater than max of range – return ‘no gaps’. If input array is empty – return min of range. Otherwise find min/max/median of input array – if min > min of range – return min of range – if max < max of range – return max of range. Otherwise filter input array to values between min and median (exclusive) or median and max (exclusive), which ever range is has lowest density – if equal choose one. Then recurse using this filtered array and its corresponding exclusive range.
Performance analysis is trivial – each recursion reduces the number of elements considered by half (since the median is the halfway point), and each step takes time proportional to the size of the current array – leading to the N+ N/2 + N/4 … worst case which gives O(N).
Correctness depends on there being no duplicates – obviously whenever it returns something other than ‘no gaps’ it has found a gap, but it is the no duplicates which lets us say that it will only return ‘no gaps’ if there are no gaps. Because there are no duplicates, any given range in the sorted sequence either has no gaps (which gives a density of 1) or it has gaps (which gives a density less than 1). When we divide the sequence into the groups smaller/larger than the median and calculate the density of each – all we really care about is whether the density is 1 or not. Since the no gaps case always has a higher density than the gaps case, we will always recurse into a side which contains gaps – unless both densities are 1. (Indeed trivial extension of the original algorithm above is that if both densities are 1, return no gaps rather than recursing.) If we do recurse into a side which has no gaps, we will keep recursing until the input contains 2 or less consecutive numbers – the next recurse will then consist of an empty array and a max less than min – which is the case we return no gaps.
The choosing of lower density is simply an average time performance improvement – if the input set is biased, the lower density section is more likely to terminate in less recursions. Technically that step can just be ‘choose a side with density which is not 1’.
As an addendum – this algorithm is a bit slow if guaranteed worst case O(N) is not required (because of the cost of the guaranteed O(N) median finding algorithm) – you can drop the median requirement entirely, and just choose a random element as the pivot to filter the array. This still gives expected running time of O(N) based on the randomisation – but it can degenerate – and in the worst degenerative scenario you will need to deal with an extra case of a 0 length filtered array with 0 width range – which should be treated as if it has a density of 1.
having just completed the Stanford algorithms course part 1 and 2 on coursera.org, I can proudly say I almost understand this post!
I can almost understand it as well – my writing is pretty terrible.
Also – I think my favourite answer to the original question for the memory restricted scenario is now ‘choose 100 random integers’ – check if they are missing. If not, repeat.
100 random integers in local cache, cost of checking them is not huge compared to the cost of streaming each integer in from storage. And the probability that 100 random 32bit integers will all be in a list of at most 4 billion 32bit integers is < 0.09%. So while technically this algorithm may never complete - in practice it is O(N) with a pretty low cost of implementation.