Stefan Pohl Computer Chess

private website for chessengine-tests


Stockfish Regression Testing

 

 

Latest testrun: Stockfish 211129
Reference point (opponent) is the latest official SF-release (Stockfish 14.1 right now).
Each SF-dev version plays 20000 games versus this engine

 

Hardware: AMD Ryzen 3900 12-core (24 threads) notebook with 32GB RAM. 20 games are played simultaneously

Speed: Singlethread, TurboBoost-mode switched off, chess starting position: Stockfish 11: 1.3 mn/s, Komodo 14: 1.1 mn/s

Hash: 128MB per engine

GUI: Cutechess-cli (GUI ends game, when a 5-piece endgame is on the board)

Tablebases: None for engines, 5 Syzygy for cutechess-cli

Openings: Because meanwhile the high draw-rates made it impossible to measure Elo-progress in regression-tests, here my UHO_V3_8mvs_big_+090_+119 openings are used (part of my Anti Draw openings download). These openings have the similar eval-interval of KomodoDragon 1.0 [+0.90;+1.19] than the UHO_XXL openings, which are used in the Stockfish framework for development. But UHO V3 openings were evaluated with around 50x (!!!) more nodes per endposition, than the UHO_XXL openings, because the UHO V3 files are much smaller.

Ponder, Large Memory Pages & learning: Off

Thinking time: 30 sec + 300 ms per game/engine (average game-duration: around  1.5 minutes). One 20000 games-testrun takes about 30 hours.The version-numbers of the Stockfish engines are the date of the latest patch, which was included in the Stockfish sourcecode, not the release-date of the engine-file, written backwards (year,month,day) (example: 200807 = August, 7, 2020). The used SF compile is the AVX2-compile, which is the fastest on my AMD Ryzen CPU. SF binaries are taken from abrok.eu (except the official SF-release versions, which are taken form the official Stockfish website).

 

ORDO calculation fixed to reference-engine (Elo = 0)

You can download all played games from my Google-Drive. Mention: I deleted all comments in the pgn-files (eval, search depth etc.), because the files would be too big otherwise. Download here

 

Here the progress in regression-testing since Stockfish 14.1 (2021/10/28), with Elo of SF 14.1 set to 0 in a diagram:

     Program                      Elo      +      -   Games   Score   Av.Op.  Draws

   1 Stockfish 211129 avx2    :   23.0    3.3    3.3 20000    53.3 %      0   52.9 %
   2 Stockfish 211127 avx2    :   19.8    3.2    3.2 20000    52.8 %      0   53.2 %
   3 Stockfish 211111 avx2    :    9.0    3.2    3.2 20000    51.3 %      0   54.1 %
   4 Stockfish 211105 avx2    :    8.7    3.5    3.5 20000    51.2 %      0   54.8 %
   5 Stockfish 211121 avx2    :    8.0    3.4    3.4 20000    51.1 %      0   53.9 %
   6 Stockfish 211101 avx2    :    4.1    3.4    3.4 20000    50.6 %      0   54.6 %
   7 Stockfish 14.1 211028    :    0.0    0.9    0.9 220000   56.2 %    -48   50.1 %
   8 Stockfish 210822 avx2    :  -17.5    3.4    3.4 20000    47.5 %      0   54.0 %
   9 Stockfish 14 210702      :  -53.4    3.6    3.6 20000    42.4 %      0   51.4 %
  10 Stockfish 13 210218      : -109.3    3.4    3.4 20000    34.9 %      0   46.9 %
  11 Stockfish 12 200902      : -158.5    3.6    3.6 20000    28.8 %      0   44.8 %
  12 Stockfish final HCE      : -263.4    4.4    4.4 20000    18.2 %      0   30.5 %


Games        : 220000 (finished)

White Wins   : 102620 (46.6 %)
Black Wins   : 7178 (3.3 %)
Draws        : 110202 (50.1 %)

Stockfish 210822 was the last dev-version before the Stockfish-Framework started using my Unbalanced Human Openings (UHO) for further Stockfish development...(before that, classical balanced openings were used). Because using UHO openings for development was an important change, IMHO it makes sense to measure the progress from this point and for that, the performance of Stockfish 210822 is needed.

Stockfish final HCE (date 200731) was the latest SF dev-version, before the nnue-neural-nets were introduced. So, this engine is (and perhaps will stay forever?) the strongest HCE (Hand Crafted Eval) engine on the planet. IMHO this makes it very interesting for comparsion.


Below the regression-gamebase recalculated with my Gamepairs Rescorer Batch-Tool. Realizing Vondele's (Stockfish maintainer) idea: "Thinking uniquely in game pairs makes sense with the biased openings used these days. While pentanomial makes sense it is a bit complicated so we could simplify and score game pairs only (not games) as W-L-D (a traditional  score of 2-0, or 1.5-0.5 is just a W)."

 

   # PLAYER                   :  RATING  ERROR  PLAYED      W      D      L   (%)
   1 Stockfish 211129 avx2    :      46      5   10000   3047   5216   1737  56.5
   2 Stockfish 211127 avx2    :      40      5   10000   2969   5191   1840  55.6
   3 Stockfish 211111 avx2    :      17      5   10000   2575   5346   2079  52.5
   4 Stockfish 211105 avx2    :      17      5   10000   2554   5372   2074  52.4
   5 Stockfish 211121 avx2    :      16      5   10000   2592   5276   2132  52.3
   6 Stockfish 211101 avx2    :       7      5   10000   2400   5411   2189  51.1
   7 Stockfish 14.1 211028    :       0
   8 Stockfish 210822 avx2    :     -34      5   10000   1830   5359   2811  45.1
   9 Stockfish 14 210702      :    -103      5   10000   1208   4729   4063  35.7
  10 Stockfish 13 210218      :    -216      6   10000    503   3502   5995  22.5
  11 Stockfish 12 200902      :    -343      7   10000    177   2125   7698  12.4
  12 Stockfish final HCE      :    -542     11   10000     24    822   9154   4.3

 

You can download my Gamepairs Rescorer Tool right here