HashCodes

Ahh, hashcodes … such wonderful things.

I was recently reminded of how poorly they are used, and then further reminded of how poorly they are used when I then subsequently used them incorrectly while trying to fix the issue. Which lead me on a hunt, for the perfect hashcode combiner function.

I quickly discovered that there are several categories of hashcode combination:

  • Hash codes for structured objects or ordered lists.
  • Hash codes for a set of objects (no duplication).
  • Hash codes for unordered lists (which contain duplication).

So, for the first case theres a large variety of possibilities to consider. The System.Web assembly contains an internal class which combines hash codes using ((a<<5)+a)^b - which is simply (33*a)^b a rather vague looking formula which is somewhat reminiscent of a*x+b mod c - congruential random number generators. In this case the mod is 2^32, which isn't prime - but at least its very very fast to calculate 😉 - all in all this is an acceptable method of combining hashcodes - it doesn't suffer from the pure xor aproach's primary defect which is that if a==b the output is always 0, its fast at 3 single clock cycle instructions required, but it doesn't do all that much to compensate for bad quality input hashcodes. Both of the other two circumstances are unordered which means the hash code combiner function f(a,b) must satisfy f(a,b)==f(b,a). But further more it must satisfy f(f(a,b),c)==f(a, f(b,c)). This second condition is quite restrictive. If additionally we decide that each possible output must occur the same number of times (that is, the hashcode function isn't biased) the possibilities for the function are very restricted. For the domain of 2bit numbers there exists 2^32 functions which take two parameters. Only 16 of these satisfy the reordering and non-biased requirements. Note that I distinguished between sets and unordered lists. My primary reasoning here is that for sets, xor is pretty much perfect, the a==b case where two of the hash codes which are provided are the same only occurs during a hash-code collision in the input which is much rarer than in the unordered list with deliberate duplicates. XOR is a single clock cycle instruction, which means you won't get a hashcode faster. So as long as your input hashcodes aren't too bad, XOR is probably your best choice for calculating the hashcode of a set. This leaves the unordered list case - much more evil I tell you. My search for the perfect unordered list hashcode combiner continues, but I thought I might share some bits of research I've found along the way. Given two inputs a and b which are n-bit numbers - you can break them into n 1-bit pairs a1 b1 a2 b2 ..., where no reordering has occured, the most significant bit of a is paired with the most significant bit of b. There exists unordered hashcode combination functions which can work on groups of these 1-bit pairs. If a function operates on a single 1-bit pair, we shall call that a 1 bit combiner function, if it operates on 2 1-bit pairs then we shall call it a 2-bit combiner function. Some 2-bit combiner functions are actually just 2 1-bit combiner functions applied to each bit, but the majority are more complex. Given the concept of breaking things down, you can construct your own hashcode function for n-bits by grouping the n-bits into 1,2,3,... bit ordered lists and applying 1,2,3...bit combiner functions of your choice to each clump. With that introduction done. 1bit combiner functions: There are only 2 which satisfy all the requirements.

  • XOR
  • NOT XOR

2-bit combiner functions: There are 16 which satisfy all the requirements, if you consider order of the 1-bit pairs, important. If you don’t there are less.
First are the ones based off the 1-bit combiner functions.

  • XOR, XOR
  • XOR, NXOR – this can be reversed by changing order of bits.
  • NXOR, NXOR

The rest are new.

  • 2 bit unsigned addition ignoring overflow – upside down addition can be performed by changing order of bits.
  • 2 bit unsigned addition ignoring overflow + 1,2,3 – offset isn’t actualy very useful, but it does change the output – upside down addition can be performed by changing order ofbits.

I don’t have a name for the final type of operator – changing order of bits doesn’t do anything because its symmetrical. It can be xor’d with 1 or 2 or 3. The basic gist however is if one of the arguments is 0 or 3 it behaves like XOR, otherwise it behaves like equality, returns 3 if inputs are the same, 0 if inputs are different. If the inputs are a and b and aRot is input a with its bits switched and bRot is input b with its bits switched then logically the formula is as follows. I am hoping this can be reduced, but right now, thats beyond my puny little brain.

((~(a ^ aRot) | ~(b ^ bRot)) & (a ^ b)) | (((a ^ aRot) & (b ^ bRot)) & (a ^ bRot | b ^ aRot)) – the final b ^ aRot term is unneeded, as it has the same result as a ^bRot in that context, but its included to make the symmetry obvious.

3 bit combinators I haven’t even started on, but I’m sure there is a wealth of complex forms to be found above all the possible combinations of 1 and 2 bit combinators.