Ceres

Ceres is a newly released chess engine which has a re-implementation of MCTS search over the lc0 backends. It shows very strong search performance characteristics which will be interesting to learn from. The specific details of some of those performance characteristics would likely be difficult to copy in to lc0, without also requiring a full rewrite. So there has been some talk of possibly migrating to Ceres as a a replacement for the lc0 engine in general. This post isn’t about that though. Instead I wanted a place to capture the ‘details’ of difference as I read through the Ceres codebase.

Move selection

Ceres has a couple of novelties in move selection.

  1. It has an option to take ‘best Q’ in preference to ‘best N’ so long as the visit count is still a decent fraction of best N. Specifically this fraction depends on the search tree size and the delta in Q. The values look empirical and are step wise. Maybe a curve fit of some kind could make this a more practically tune-able feature. Notably even for the largest Q deltas in a large tree the N required for a node to be selected is 30% of best N.
  2. Due to supporting a ‘best Q’ based approach, it has a function for adding a ‘boost’ to the Q value used based on the moves left head output. Curiously its based on the moves left estimate of the ‘best Q’ without filtering for N, rather than the moves left estimate of the ‘best N’ which should be more accurate.

Search

By far the largest section of novelties is going to be the search itself… I’m not covering the micro-optimizations here, of which there are plenty. (Hand rolled AVX2 parallelization of the UCT select inner loop for instance.)

  1. Try to gather x nodes per tree walk. Where lc0 performs a separate tree walk for each node to be gathered and performs out of order updates after each such node, Ceres gathers multiple nodes in a single tree walk. This is something I’ve previously considered but was concerned that the interaction with out of order would cause too much of a problem. The basic performance benefit of not walking the same nodes multiple times to gather the batch has already been partially offset in lc0 via some smart local caching, which Ceres doesn’t need as a result. Ceres splits the gathered set and applies the out of orders as a bunch. This risks a problem that lc0 has had in the past where we allowed multiple hits on a terminal to apply at once without considering the actual back prop of those terminals. We disabled multiple visits because the speed in rare circumstances didn’t offset the risk. With Ceres however it is core to gathering speed in all scenarios, so it is likely that the overall speed improvement can offset the risk. Additionally Ceres does some other things that probably helps here… but maybe more importantly this risk is linked to batch size, not collision limits like we used to have it set up in lc0.
  2. Adaptive parallel search, if the search tree has a ‘nice’ fork point – use a second thread for the other side of the fork. This is only doable because of having a target of x nodes to gather at once. These independent parallel searchers could effectively share a single lock since they don’t work on any shared data, except for transposition handling getting in the way (?Apparently?)
  3. Adaptive batch sizes. When the tree is small Ceres tries to gather batches far smaller than optimal for gpu utilization. This probably offsets some of the risk of getting multiple hits to a terminal without back prop, which is especially dangerous when the size of the batch is a noticeable as a proportion of the total tree.
  4. Even with adaptive batch sizes, split them smaller for the parallel gather and use heuristics to decide whether to do the rest. This seems like further offsetting the out of order risk, but keeping the potential to gain search performance from the adaptive parallel search.
  5. Two gatherers that use semi-independent n in flight calculations (adding in a fraction of the others), with separate cpuct values to reduce the probability that they gather nodes that are similar. I’m unclear why this can really work in practice (although the fractional addition of the other helps), but assuming it does it is pretty nice.
  6. Fractional relative virtual loss. This looks like a critical component at first glance. One of the issues facing trying to gather ‘x nodes at once’ in lc0’s model is that we don’t use virtual loss – instead we use ‘virtual visit’, leaves the nodes current q value in place and just increments N. True virtual loss causes selection of really bad nodes too easily, but virtual visits can require quite a large number of visits in order to change selections. Thus lc0 often has to calculate a ‘multivisit’ collision in order to avoid just colliding on the same thing over and over. Ceres uses an alternative in between, using a parent relative virtual loss that is ‘small’ as opposed to the absolute virtual loss (unless the position is absolutely losing already, in which case its worse). This makes it much easier to gather batches, but the batches still aren’t as bad as pure virtual loss.
  7. Transposition handling. Seems like work in progress still, but the default mode does at least return the transposition node Q rather than transposition node V, which is a slight step up over simple caching.
  8. Adaptive batch evaluation. For small search trees it will not let there be any pending evaluations before starting next batch gathering. For lc0 this would be like starting one search thread and then starting another once the tree was big enough. Further offsets the risk during the small tree.
  9. Use of MLH in search is currently disabled – curious what led to that decision…

Backend Handling

Where lc0 uses 2 search threads which block on the backend, to try and keep a single backend busy. Ceres instead allocates 2 separate backend instances, potentially asking the gpu to do more than one piece of work at once. I wouldn’t immediately expect this to be a performance gain, but if it is it should be easy to simulate in lc0 via creating a roundrobin backend with the same gpu listed twice.

And I think that about covers it for an initial pass through the core search components of the code base. It is a large amount of new code so I am sure I’ve missed several things. There are a lot of features I’ve not mentioned, that are currently disabled.

A new post

So, its been a long time since I wrote anything on this blog obviously… I’ve been distracted by some new hobbies. Specifically I’ve been helping to run the training for the leela chess zero project. www.lczero.org

I still do some competitive coding, but not in any public external competitions. So I probably won’t be writing up any summaries any time soon. But maybe I’ll write up some stuff on some other topics…