|Tom Kerrigan's Home Page|
Here are step-by-step instructions for turning a single-threaded chess engine into a high-performance multithreaded engine.
Step 1: Put your chess engine in a C++ class so you can easily instantiate multiple engines and run them in different threads.
Set up your engine so that when you do a search, it does the same exact search on multiple threads. This should be like running multiple independent copies of your single-threaded chess engine.
Make sure the output from each engine/thread is exactly the same: node counts, etc. (This is a good way to make sure your engines aren't sharing memory unintentionally. Sharing memory can ruin performance.)
Step 2: Figure out how to turn off Turbo Boost.
Your processor probably runs at different clock speeds depending on how many cores are under load. This makes it almost impossible to calculate useful speedup numbers. Find a utility (or a command-line incantation) that allows you to easily enable and disable your processor's Turbo Boost feature.
Step 3: Reduce your engine's memory traffic until you get near-ideal speedups in terms of NPS (nodes per second).
Given the size of a modern processor's caches, your engine should be able to search at least 15.5x as many NPS with 16 cores. If your engine doesn't scale that well, here are some ideas for reducing memory traffic:
Step 4: Make a test suite to measure parallel search speedups.
Find ~10 positions that satisfy these criteria:
Make a test suite with these positions, and include the following metadata for each position:
Modify your test suite code so that for each position, it only searches the number of ply necessary to find the solution and it prints out speedup numbers for time-to-solution, time-to-ply, and NPS. It should also print out average speedup numbers at the end of the test.
Note that parallel searches are non-deterministic and you should run your test suite several times before drawing any conclusions from the speedup numbers.
Step 5: Switch to a shared hash table (transposition table).
Modify your engine so all your threads use a shared hash table.
Because it's a shared resource, you'll have to synchronize access to it with a lock (or multiple locks):
Expected average speedup: ~6x with 16 cores.
Step 6: Implement Simplified ABDADA and Cutoff Checks.
See these pages:
Expected average speedup: ~9x with 16 cores.
Step 7: Synchronize the root search parameters.
Thread synchronization with ABDADA is good but not always perfect. If one of your threads finishes searching a ply (for example), you don't want the other threads to keep searching that ply.
Make a shared data structure to store the most current root search parameters:
Expected average speedup: ~10x with 16 cores.
See this page for exact speedup numbers.