How To

Here are step-by-step instructions for turning a single-threaded chess engine into a high-performance multithreaded engine.

Step 1: Put your chess engine in a C++ class so you can easily instantiate multiple engines and run them in different threads.

Set up your engine so that when you do a search, it does the same exact search on multiple threads. This should be like running multiple independent copies of your single-threaded chess engine.

Make sure the output from each engine/thread is exactly the same: node counts, etc. (This is a good way to make sure your engines aren't sharing memory unintentionally. Sharing memory can ruin performance.)

Step 2: Figure out how to turn off Turbo Boost.

Your processor probably runs at different clock speeds depending on how many cores are under load. This makes it almost impossible to calculate useful speedup numbers. Find a utility (or a command-line incantation) that allows you to easily enable and disable your processor's Turbo Boost feature.

Step 3: Reduce your engine's memory traffic until you get near-ideal speedups in terms of NPS (nodes per second).

Given the size of a modern processor's caches, your engine should be able to search at least 15.5x as many NPS with 16 cores. If your engine doesn't scale that well, here are some ideas for reducing memory traffic:
  • Don't probe the transposition table in your quiescence search.
  • Reduce the size of your pawn hash table.

Step 4: Make a test suite to measure parallel search speedups.

Find ~10 positions that satisfy these criteria:
  • It takes your engine more than 1 minute to find the solution.
  • Your engine finds the solution and finishes the ply in less than 2 minutes.
A good way to find positions like this is to run a large test suite (like Encyclopedia of Chess Middlegames) for 2 minutes per position and look at the output log.

Make a test suite with these positions, and include the following metadata for each position:
  • How many ply it takes to find the solution
  • How long it takes to find the solution
  • How long it takes to finish the ply
  • How many NPS it searches
(If your test suite file format doesn't support metadata like this, extend the format!)

Modify your test suite code so that for each position, it only searches the number of ply necessary to find the solution and it prints out speedup numbers for time-to-solution, time-to-ply, and NPS. It should also print out average speedup numbers at the end of the test.

Note that parallel searches are non-deterministic and you should run your test suite several times before drawing any conclusions from the speedup numbers.

Step 5: Switch to a shared hash table (transposition table).

Modify your engine so all your threads use a shared hash table.

Because it's a shared resource, you'll have to synchronize access to it with a lock (or multiple locks):
  • You could use one lock for the entire table, but this is a bad idea. Threads would have to wait on each other even if they're accessing different entries. And all the threads would be accessing one location in memory (the lock) which translates to enormous cache coherence overhead.
  • You can use an array of locks, with each lock corresponding to a different set of entries. (Use the low-order bits of the hash key to see which lock to use.) Make sure you have enough locks to avoid false sharing.
  • For optimal scaling, add a spinlock (4 bytes) to each hash table entry.
Switching to a shared hash table will result in significant speedups because your threads won't be perfectly synchronized: one thread might finish a position before another thread starts it.

Expected average speedup: ~6x with 16 cores.

Step 6: Implement Simplified ABDADA and Cutoff Checks.

See these pages:

Simplified ABDADA
Cutoff Checks

Expected average speedup: ~9x with 16 cores.

Step 7: Synchronize the root search parameters.

Thread synchronization with ABDADA is good but not always perfect. If one of your threads finishes searching a ply (for example), you don't want the other threads to keep searching that ply.

Make a shared data structure to store the most current root search parameters:
  • Alpha
  • Beta
  • Depth
  • Best move
  • Move that failed a null-window search (if any)
Modify your search so it regularly checks to make sure it's using the latest parameters. If it isn't, it should stop and re-start. If a move fails a null-window search, it's especially important that all the threads start searching that move as soon as possible.

Expected average speedup: ~10x with 16 cores.

See this page for exact speedup numbers.