A new release

TMD.Algo 0.0.2.0 finally escaped.

You can get it here.

There is a bunch of new stuff, however not all of it has been tested yet, so as usual use with caution.

The graph class is especially ultra-alpha right now.

One final note, this is a source code and binary release, but it won’t actually compile out of the box.
If you want to make it compile there are several things to do.

  1. Create a private key to sign your custom builds (or remove the key signing).
  2. Update the InternalsVisibleTo on TMD.Algo so the test cases can see the ugly internal methods.
  3. Install NUnit 2.4.8 (or maybe a more recent version will work).
  4. Install FxCop 1.36 into the default location (or modify the post build step for TMD.Algo).

Hopefully that should cover it, I haven’t actually tried the steps myself.

Generics and .Net internals

I noticed the following method while on a journey through reflector the other day.

RuntimeTypeHandle.CreateInstanceForAnotherGenericParameter(Type otherType)

Quite an odd method really.

Example in use:
GenericEqualityComparer.TypeHandle.
CreateInstanceForAnotherGenericParameter(typeof(T));

This lets you make a GenericEqualityComparer<T> instance.

First thought is why wouldn’t you write
new GenericEqualityComparer<T>()
?

As it turns out its because GenericEqualityComparer has a restriction which says T : IEquatable and the current generic type calling this code has no restriction on T.

So this method is really useful. It allows you to write a set of if statements checking what T implements and then defer processing to appropriate special case generics.

Unfortunately the above method is internal, so you would have to do the equivelent with a bunch of reflection, probably a bit slower too.

I am thinking that we should petition Microsoft to provide a language construct which is syntactic sugar for this method.

var v = new GenericEqualityComparer<T>() : where T is IEquatable;
Could throw an InvalidOperationException if you haven’t guarded for the case where T is not IEquatable;

In other news 0.0.2.0 of TMD.Algo should be out soon, I have a dequeue, a couple more sorts (really basic ones) and an alternative Heap which supports O(1) Contains, and O(log n) Remove at the expense of some extra book keeping. Useful for implementing Prim’s algorithm as the key reduction step can be done in O (log n) (as needed) by simply removing the entry and adding the new one after changing the key.

Google Code Jam 08

So, its been a couple of years since I last entered the google code jam. Back in 05 I came ~120th, and after a few cancelations it almost looked like I was going to go to the finals.

This year however the google code jam has been run in a dramatically different format. No longer powered by top coder, everyone runs their code on their own machines. This lets google’s site scale alot better, although it does add to the overhead of cheat detection.

Also new this year are regional finals. The top 500 competing at a ‘nearby’ google office before the real finals in the US.

So in a couple of weeks I will be at Google Sydney competing to be in the top 20% in APAC, so I can go to the US. I am unsure if this will be tougher then top 100 world wide, but at least the competition won’t be at 2am like the last two rounds before it. Still, there are alot of tough competitors in the APAC region.

I do have an incentive though, a friend of mine recently had his internet company bought by Google and he is working for YouTube now. He has offered a game of frisbee golf if I make it over there.

They offer job interviews to a subset of the people going. I was offered one but I turned it down. Several people have already commented that I am crazy for doing so. Maybe I am crazy because I personally can’t think what I would do at Google.

Finally however, GCJ 08 terms and conditions allow for the downloading of publically available libraries to assist in the writing of solutions during the competition. Specifically any library used must be available to all other contestants as well.

I have started work on such a library tailor made for algorithm competitions, and over the next couple of weeks intend to release it under a BSD or equivelent license via this website. The library will be for .Net and I am going to target .Net 3.5 as VS2008 is available at the competition.

For starters, any data structure in CLR’s Introduction To Algorithms, which is not in the .Net base libraries, I intend to implement. Additionally any specific algorithm which can be generalized from this book I also intend to provide. I am going to see if extension methods can be applied to delegates, because I think that would be a nice way to write ternary/binary searches for minimum/thresholds.

I currently have an implementation of IList which is backed by a red-black tree for log(n) indexed inserts. I also have another one which tracks the sum of all elements of the list prior to the current one efficiently (Specifically, it is actually generalized based on delegates for ‘addition’ and ‘subtraction’, so as long as the operator in question can be ‘undone’ it can be any accumulating operation).

If I have time, I am going to investigate implementing ‘flattened’ versions of these data structures, which store indexes into an array of nodes and maintain a free list, rather than allocating the nodes directly on the heap and using pointers. This technique usually results in a performance improvement, especialy in a garbage collected runtime. Additionally, if the array is expanded using the standard List expansion logic, the amortized additional costs on inserts are O(1). There is a memory penalty, although on a 64bit OS, with a limitation of 2^32 nodes in the collection, the memory penalty is significantly lower due to using ints instead of 64bit pointers. I am interested to see how a flattened linked list compares to the .Net runtimes built-in linked list.

About time to post again

Its been a while, but I plan a couple of posts.

Firstly, I was recently reminded of a .Net ‘trick’ I learnt from a friend. We recently took this trick to a new level of ‘evil’.

Strings are immutable right… well sort of.

string a = “a”;
GCHandle handle = GCHandle.Alloc(a, GCHandleType.Pinned);
char[] newString = new char[] {‘b’};
Marshal.Copy(newString, 0, handle.AddrOfPinnedObject(), 1);
Console.Out.WriteLine(a);
Console.ReadKey();

This program outputs the character b. All without actually using the unsafe keyword. The unsafe APIs already exist there to use. (They all have unmanaged permission link demand on them.)

Taking things to the next level however, allows for a bit of fun.

Examples include:
Environment.NewLine, change it to any two characters of your choice.
Exception messages, cause it once, change it and from then on it says whatever you like.

With a bit of reflection however, you can do alot of damage. For instance, change every string which can be found by walking the object hierarchy starting from static fields.

Site Change

So, after a significantly long time of this site just sitting here, a copious amount of spam had accumulated in both the wiki and the phpbb forums. So I’ve decided that given there was a grand total of 3 articles in the wiki anyway, maybe the best shot is for me to just post stuff here. I’ve also removed the old blog posts which were just me mumbling about what to do with the website.

In this incredibly long post I shall detail the only post from the wiki I wanted to keep visible to the public.

View State
ViewState in .Net 2.0 is encoded using the ObjectStateFormatter Class, which optimises the encoding for comon cases. Knowing these common cases can help when deciding what to store in viewstate.

The viewstate object which is persisted is itself a complex object formed based on page content.

At the top level it is a Pair class with:

  • A HybridDictionary indexed by control ID of control states (which themselves recursively contain subcontrol control states)
  • A pair containing:
    • Page.GetTypeHashCode().ToString(NumberFormatInfo.InvariantInfo)
    • Constructed ViewState

Constructed ViewState is either:

A Pair:

  • Page.SaveViewState() output (or Control.SaveViewState() if its Constructed ViewState for a control)
  • ArrayList containing alternating entries of ControlID and Constructed ViewState for the control.

Or a Tripple:

  • Page.Adapter.SaveAdapterViewState() output (or Control.Adapter.SaveAdapterViewState()…)
  • Page.SaveViewState() output (or Control.SaveViewState() …)
  • ArrayList containing alternating entries of ControlID and Constructed ViewState for the control.

ObjectStateFormatter

Assembly: System.Web
Namespace: System.Web.UI
New in .Net 2.0

ObjectStateFormatter Format

The MSDN docs describe the format in a basic sense, but what follows is a more detailed description which may assist in the optimisation of viewstate for a page. For instance the MSDN docs state that Enum is optimised, but as you can see in what follows, storing the enum value into the viewstate directly is much more efficient.

This format may change in future versions of .Net so keep that in mind.

The overall format consists of two leading bytes 0xFF 0x01 followed by encoded data as follows.

The basic format consists of a ‘tag’ which is 1 byte long, followed by data, which is handled according to the tag. The contents of the tag may be nested content of the same format in some circumstances.

* Tag: 0x01
* Data: A short encoded using BinaryWriter Class
* Represents: a short.

* Tag: 0x02
* Data: Encoded integer. 7bits stored in each byte, top bit is 0 when there are no more bytes. First byte contains the least significant 7bits next byte contains the next least significant 7 bits etc etc.
* Represents: a non-zero number.

* Tag: 0x03
* Data: A byte encoded using BinaryWriter Class
* Represents: a byte.

* Tag: 0x04
* Data: A char encoded using BinaryWriter Class
* Represents: a char.

* Tag: 0x05
* Data: A string encoded using BinaryWriter Class (This is UTF8)
* Represents: a non-empty string.

* Tag: 0x06
* Data: A long encoded using BinaryWriter Class
* Represents: A DateTime instance, where the long is the output from DateTime.ToBinary()

* Tag: 0x07
* Data: A double encoded using BinaryWriter Class
* Represents: a double.

* Tag: 0x08
* Data: A float encoded using BinaryWriter Class
* Represents: a float.

* Tag: 0x09
* Data: An int encoded using BinaryWriter Class.
* Represents: An instance of the Color struct. The encoded int is an ARGB value from Color.ToArgb().

* Tag: 0x0A
* Data: A 7 bit per byte encoded value (see above).
* Represents: A Color corresponding to a member of the KnownColor enumeration. Value indicates the value of the KnownColor to which the stored color corresponds.

* Tag: 0x0B
* Data: An encoded type (see Data definition for Tag 0x19) followed by a 7 bit per byte encoded value (see above).
* Represents: An instance of a typed enum which has base type if int. The encoded type is the type of enum, and the value is the integer value of the enum instance.

* Tag: 0x0C
* Data: Empty (0 bytes)
* Represents: Color.Empty

* Tag: 0x0F
* Data: Two consecutive blocks encoded using this format.
* Represents: an instance of the Pair Class – first encoded block is the First property, second is the Second property

* Tag: 0x10
* Data: Three consecutive blocks encoded using this format.
* Represents: an instance of the Tripple Class – first is First property, second is Second prperty, third is Third property

* Tag: 0x14
* Data: An encoded type (see Data definition for Tag 0x19) followed by a 7 bit per byte encoded count (see above) followed by ‘count’ blocks encoded using this format.
* Represents: A one-dimensional typed array which contains more than 25% non-null values. The encoded type represents the type of the array. The encoded count represents the length. The blocks represent the contents of the array encoded in order from 0th element to the length-1 element.

* Tag: 0x15
* Data: 7bit per byte encoded count (see above) followed by ‘count’ strings encoded with BinaryWriter Class (This uses UTF8).
* Represents: an instance of string[] which contains no null values.

* Tag: 0x16
* Data: 7bit per byte encoded count (see above) followed by ‘count’ encoded blocks.
* Represents: an instance of the ArrayList Class – not a class which inherits from ArrayList, each block represents a value stored in the ArrayList in the order they are stored.

* Tag: 0x17
* Data: 7bit per byte encoded count (see above) followed by 2*’count’ encoded blocks.
* Represents: an instance of the Hashtable Class – not a class which inherits from Hashtable, each pair of blocks represents a key value pair from the Hashtable.

* Tag: 0x18
* Data: 7bit per byte encoded count (see above) followed by 2*’count’ encoded blocks.
* Represents: an instance of the HybridDictionary Class – not a class which inherits from HybridDictionary, each pair of blocks represents a key value pair from the HybridDictionary

* Tag: 0x19
* Data: a tag byte followed by tag specific data.
* Represents: a type.
o If tag byte is 0x29 the following data is the full name string of a type in the System.Web Assembly encoded using BinaryWriter Class.
o If tag byte is 0x2A the following data is the assembly qualified name string of a type encoded using BinaryWriter Class.
o If tag byte is 0x2B the following data is a 7bit per byte encoded integer index into the order of arrival of types to the encoder. Unlike IndexedString encoding, the counter is a full int and not just a byte, so there is no elaborate reseting system.
o Subtags 0x29 and 0x2A are only used for the first occurance of each specific type in the output stream. All subsequent references use 0x2B subtags.

* Tag: 0x1A (not used by encoder??)
* Data: An encoded type (see Data definition for Tag 0x19) followed by a block encoded using this format.
* Represents: An instance of a nullable type. Implementation appears broken, as it uses reflection to access a FromObject method on the Nullable`1 Class which does not seem to exist. The encoded type is supposed to be the underlying type of the nullable instance, and the encoded block the current value of the nullable instance.

* Tag: 0x1B
* Data: A double encoded using BinaryWriter Class followed by an int encoded using BinaryWriter Class
* Represents: An instance of the Unit Class. The double corresponds to the value of the Unit, and the int corresponds to the enumeration value of the UnitType property of the Unit.

* Tag: 0x1C
* Data: Empty (0 bytes)
* Represents: Unit.Empty

* Tag: 0x1E
* Data: string encoded using BinaryWriter Class (This is UTF8)
* Represents: An occurance of an IndexedString. Gauranteed to be different from the previous 256 occurances encoded using this tag.

* Tag: 0x1F
* Data: a single byte.
* Represents: An IndexedString. Which IndexedString is determined by using the byte as a lookup into the current IndexedStrings array.

* Note: the above two tags are encoded as follows. A lookup is used which contains upto 255 entries. As each IndexedString arrives at the encoder, it is checked to see if it is in the lookup. If so, the index stored in the lookup is stored using the 0x1F tag. Otherwise it is stored in the lookup as the next number, unless the next number is 255. If the next number is 255 it replaces the entry mapped to 0 in the lookup, and the next number is set to 1. The IndexedString is then stored using the 0x1E tag. This can be decoded by reading 0x1E tag encoding into a lookup and when the lookup index reaches 255, replacing 0 then 1 then 2 … etc in turn. 0x1F is decoded by indexing into the lookup array.

* Tag: 0x28
* Data: An encoded type (see Data definition for Tag 0x19) followed by a string encoded using BinaryWriter Class (This is UTF8)
* Represents: An instance of an object which has a type converter which can convert to/from string. The encoded type represents the type of the object, and the string represents the encoding using TypeConverter.ConvertToInvariantString(Object).

* Tag: 0x32
* Data: A 7bit per byte encoded length (see above) followed by a byte array stored using BinaryWriter Class.
* Represents: An instance of an object which can stored using BinaryFormatter. The length indicates the length of the byte array which follows and the byte array itself represents the output of BinaryFormatter.Serialize(Stream,Object). A memory stream to hold this output during encoding is by default initialized to 256bytes in length. Decoding uses a pool of 4k memory streams.

* Tag: 0x3C
* Data: An encoded type (see Data definition for Tag 0x19) followed by a 7 bit per byte encoded length followed by a 7 bit per byte encoded count followed by ‘count’ pairs of 7 bit per byte encoded indexes and data encoded using this format.
* Represents: a typed one dimensional array with a large number of null values (>75%). The encoded length represents the length of the array, and the encoded count represents the number of non-null values. The following pairs represent indexes into the array and the non-null values stored in them.
* Note: the encoder uses recursion at this point, so theoretically an excessively deep nested sparse array could cause a stack overflow, Extremely unlikely though.

* Tag: 0x64
* Data: Empty (0 bytes)
* Represents: null reference

* Tag: 0x65
* Data: Empty (0 bytes)
* Represents: empty string

* Tag: 0x66
* Data: Empty (0 bytes)
* Represents: The integer 0

* Tag: 0x67
* Data: Empty (0 bytes)
* Represents: boolean true

* Tag: 0x68
* Data: Empty (0 bytes)
* Represents: boolean false

Notes:
* Deserialization uses recursion, so a paticually nested data structure of supported types could cause a stack overflow (Extremely unlikely)
* ArrayList takes less space to encode than int[] double[] or any basic type array other than string[] due to the fully qualified assembly name for int, double, whatever having to be encoded at least once in the output, and the type lookup index for any other uses.
* No support for generic collections. (Other than binary serialization/type converter)
* No support for multi-dimensional arrays. (Other than binary serialization/type converter)
* No support for nullable type instances.
* If you have an instance of SortedList, best conversion is not to DictionaryEntry[], an ArrayList with even indexes for the keys and odd indexes for the values will take much less space and perform better.
* IDictionary is not optimised despite being listed as such on the msdn docs, only Hashtable and HybridDictionary instances of IDictionary are.
* The above Overall Format is potentially Encrypted or appended with a keyed MAC depending on Page settings. Finally it is base64 encoded. It is not compressed.
* Guid is not a fundamental type in the encoding. Oddly enough Guid.ToString(“N”) and Guid.ToByteArray() will encode in virtual about the same space, but ToByteArray will have to store a type identifier at least once, and is longer even as the number of guid’s approaches infinity. Therefore Guid.ToString(“N”) is best simple encoding method – new ArrayList(Guid.ToByteArray()) is an equivelent in size alternative, but harder to reconstruct in load view state. (Another more complex alternative is to convert the ToByteArray into a string using a better encoding then hex.)

Free Will and Magic

I was recently reading a few New Scientist articles and reader letters on the topic of Free Will and Consciousness. Always a popular topic which I usually file in the ‘why do I care’ basket. However my mind wandered to Arthur C Clarke’s third law of prediction ‘Any sufficiently advanced technology is indistinguishable from magic.’

Having thought of this I came up with a few statements of my own.

  1. Any sufficiently advanced recursive analysis system is indistinguishable from a conscious system.
  2. Any sufficiently advanced recursive analysis system with a blindfold layer is indistinguishable from a system with Free Will.
  3. Any sufficiently advanced recursive analysis system with an internal random source is indistinguishable from a system with Free Will.

I should probably explain the concept of a blindfold layer. A blindfold layer is a part of the analysis system which only provides inputs to the rest of the system, it is not recursivley linked. This means that the consciousness which comes from recursive analysis cannot see what is happening in this additional section of input.

2 and 3 are linked via ‘Any sufficiently complicated chaotic system is indistinguishable from randomness’.

Finally, I conclude from my thoughts of the above postulates that ‘Any system of free will which is not based on 2 or 3 involves data sources which are neither random nor deterministic.’

So umm… yeah.

HashCodes

Ahh, hashcodes … such wonderful things.

I was recently reminded of how poorly they are used, and then further reminded of how poorly they are used when I then subsequently used them incorrectly while trying to fix the issue. Which lead me on a hunt, for the perfect hashcode combiner function.

I quickly discovered that there are several categories of hashcode combination:

  • Hash codes for structured objects or ordered lists.
  • Hash codes for a set of objects (no duplication).
  • Hash codes for unordered lists (which contain duplication).

So, for the first case theres a large variety of possibilities to consider. The System.Web assembly contains an internal class which combines hash codes using ((a<<5)+a)^b - which is simply (33*a)^b a rather vague looking formula which is somewhat reminiscent of a*x+b mod c - congruential random number generators. In this case the mod is 2^32, which isn't prime - but at least its very very fast to calculate 😉 - all in all this is an acceptable method of combining hashcodes - it doesn't suffer from the pure xor aproach's primary defect which is that if a==b the output is always 0, its fast at 3 single clock cycle instructions required, but it doesn't do all that much to compensate for bad quality input hashcodes. Both of the other two circumstances are unordered which means the hash code combiner function f(a,b) must satisfy f(a,b)==f(b,a). But further more it must satisfy f(f(a,b),c)==f(a, f(b,c)). This second condition is quite restrictive. If additionally we decide that each possible output must occur the same number of times (that is, the hashcode function isn't biased) the possibilities for the function are very restricted. For the domain of 2bit numbers there exists 2^32 functions which take two parameters. Only 16 of these satisfy the reordering and non-biased requirements. Note that I distinguished between sets and unordered lists. My primary reasoning here is that for sets, xor is pretty much perfect, the a==b case where two of the hash codes which are provided are the same only occurs during a hash-code collision in the input which is much rarer than in the unordered list with deliberate duplicates. XOR is a single clock cycle instruction, which means you won't get a hashcode faster. So as long as your input hashcodes aren't too bad, XOR is probably your best choice for calculating the hashcode of a set. This leaves the unordered list case - much more evil I tell you. My search for the perfect unordered list hashcode combiner continues, but I thought I might share some bits of research I've found along the way. Given two inputs a and b which are n-bit numbers - you can break them into n 1-bit pairs a1 b1 a2 b2 ..., where no reordering has occured, the most significant bit of a is paired with the most significant bit of b. There exists unordered hashcode combination functions which can work on groups of these 1-bit pairs. If a function operates on a single 1-bit pair, we shall call that a 1 bit combiner function, if it operates on 2 1-bit pairs then we shall call it a 2-bit combiner function. Some 2-bit combiner functions are actually just 2 1-bit combiner functions applied to each bit, but the majority are more complex. Given the concept of breaking things down, you can construct your own hashcode function for n-bits by grouping the n-bits into 1,2,3,... bit ordered lists and applying 1,2,3...bit combiner functions of your choice to each clump. With that introduction done. 1bit combiner functions: There are only 2 which satisfy all the requirements.

  • XOR
  • NOT XOR

2-bit combiner functions: There are 16 which satisfy all the requirements, if you consider order of the 1-bit pairs, important. If you don’t there are less.
First are the ones based off the 1-bit combiner functions.

  • XOR, XOR
  • XOR, NXOR – this can be reversed by changing order of bits.
  • NXOR, NXOR

The rest are new.

  • 2 bit unsigned addition ignoring overflow – upside down addition can be performed by changing order of bits.
  • 2 bit unsigned addition ignoring overflow + 1,2,3 – offset isn’t actualy very useful, but it does change the output – upside down addition can be performed by changing order ofbits.

I don’t have a name for the final type of operator – changing order of bits doesn’t do anything because its symmetrical. It can be xor’d with 1 or 2 or 3. The basic gist however is if one of the arguments is 0 or 3 it behaves like XOR, otherwise it behaves like equality, returns 3 if inputs are the same, 0 if inputs are different. If the inputs are a and b and aRot is input a with its bits switched and bRot is input b with its bits switched then logically the formula is as follows. I am hoping this can be reduced, but right now, thats beyond my puny little brain.

((~(a ^ aRot) | ~(b ^ bRot)) & (a ^ b)) | (((a ^ aRot) & (b ^ bRot)) & (a ^ bRot | b ^ aRot)) – the final b ^ aRot term is unneeded, as it has the same result as a ^bRot in that context, but its included to make the symmetry obvious.

3 bit combinators I haven’t even started on, but I’m sure there is a wealth of complex forms to be found above all the possible combinations of 1 and 2 bit combinators.