Stefan Pohl Computer Chess

private website for chessengine-tests


Stockfish Regression Testing with long thinking time 

(10min+3sec = average game duration 30 minutes!!!)

 

Latest testrun: Stockfish 220806
Reference point (opponent) is the latest official SF-release (Stockfish 15 right now).
Each SF-dev version plays 2000 games versus this engine

 

Hardware: AMD Ryzen 3900 12-core (24 threads) notebook with 32GB RAM. 20 games are played simultaneously

Speed: Singlethread, TurboBoost-mode switched off, chess starting position: Stockfish 15: 750000 n/s

Hash: 512MB per engine

GUI: Cutechess-cli (GUI ends game, when a 5-piece endgame is on the board)

Tablebases: None for engines, 5 Syzygy for cutechess-cli

Openings: Because meanwhile the high draw-rates made it impossible to measure Elo-progress in regression-tests, here my UHO_2022_6mvs_+120_+129 openings are used (part of my UHO 2022 download). 

Ponder, Large Memory Pages & learning: Off

Thinking time: 10min+3sec per game/engine (average game-duration: around  30 minutes). One 2000 games-testrun takes about 47 hours.The version-numbers of the Stockfish engines are the date of the latest patch, which was included in the Stockfish sourcecode, not the release-date of the engine-file, written backwards (year,month,day) (example: 200807 = August, 7, 2020). The used SF compile is the AVX2-compile, which is the fastest on my AMD Ryzen CPU. SF binaries are taken from abrok.eu (except the official SF-release versions, which are taken form the official Stockfish website).

 

ORDO calculation fixed to reference-engine (Elo = 0)

You can download all played games from my Google-Drive. Download here

 

Here the progress in regression-testing since Stockfish 15 (2022/04/18), with Elo of SF 15 set to 0 in a diagram:

     Program                    Elo    +    -  Games    Score   Av.Op. Draws

   1 Stockfish 220713 avx2    :   19   11   11  2000    52.7%      0   48.4%
   2 Stockfish 220704 avx2    :   19   10   10  2000    52.7%      0   49.5%
   3 Stockfish 220806 avx2    :   17   11   11  2000    52.4%      0   49.8%
   4 Stockfish 220724 avx2    :   16   11   11  2000    52.3%      0   51.1%
   5 Stockfish 220709 avx2    :   16   12   12  2000    52.3%      0   49.4%
   6 Stockfish 220515 avx2    :   14   11   11  2000    52.0%      0   51.0%
   7 Stockfish 220602 avx2    :   13   11   11  2000    51.9%      0   51.8%
   8 Stockfish 220607 avx2    :   12   11   11  2000    51.8%      0   49.5%
   9 Stockfish 220529 avx2    :   12   10   10  2000    51.7%      0   53.0%
  10 Stockfish 220620 avx2    :    8   11   11  2000    51.1%      0   50.0%
  11 Stockfish 15 220418      :    0    2    2 36000    54.8%    -36   49.6%
  12 Stockfish 220422 avx2    :   -3   12   12  2000    49.5%      0   51.5%
  13 Stockfish 220504 avx2    :   -7   11   11  2000    49.0%      0   49.3%
  14 Stockfish 14.1 211028    :  -53   11   11  2000    42.5%      0   49.8%
  15 Stockfish 14 210702      :  -98   10   10  2000    36.4%      0   52.5%
  16 KomodoDragon 3 avx2      : -108   11   11  2000    35.1%      0   49.0%
  17 Stockfish 13 210218      : -134   11   11  2000    31.8%      0   49.5%
  18 Stockfish 12 200902      : -169   11   11  2000    27.6%      0   49.3%
  19 Stockfish final HCE      : -231   13   13  2000    21.1%      0   38.9%


Games        : 36000 (finished)

White Wins   : 17824 (49.5 %)
Black Wins   : 309 (0.9 %)
Draws        : 17867 (49.6 %)

Stockfish final HCE (date 2020/07/31) was the latest SF dev-version, before the nnue-neural-nets were introduced. So, this engine is (and perhaps will stay forever?) the strongest HCE (Hand Crafted Eval) engine on the planet, besides newer Stockfish-engines with nnue-net eval switched off.


Below the regression-gamebase recalculated with my Gamepairs Rescorer Batch-Tool. Realizing Vondele's (Stockfish maintainer) idea: "Thinking uniquely in game pairs makes sense with the biased openings used these days. While pentanomial makes sense it is a bit complicated so we could simplify and score game pairs only (not games) as W-L-D (a traditional  score of 2-0, or 1.5-0.5 is just a W)."

 

   # PLAYER                   :  RATING  ERROR  PLAYED     W    D    L  Score

   1 Stockfish 220704 avx2    :      38     14    1000   286  537  177  55.5%
   2 Stockfish 220713 avx2    :      38     14    1000   273  562  165  55.4%
   3 Stockfish 220724 avx2    :      33     15    1000   277  539  184  54.6%
   4 Stockfish 220806 avx2    :      33     15    1000   272  551  177  54.8%
   5 Stockfish 220709 avx2    :      32     15    1000   273  544  183  54.5%
   6 Stockfish 220515 avx2    :      29     15    1000   259  564  177  54.1%
   7 Stockfish 220602 avx2    :      26     15    1000   258  558  184  53.7%
   8 Stockfish 220607 avx2    :      24     15    1000   249  571  180  53.5%
   9 Stockfish 220529 avx2    :      24     15    1000   262  543  195  53.4%
  10 Stockfish 220620 avx2    :      15     15    1000   242  559  199  52.1%
  11 Stockfish 15 220418      :       0
  12 Stockfish 220422 avx2    :      -7     15    1000   208  565  227  49.0%
  13 Stockfish 220504 avx2    :     -14     14    1000   207  545  248  48.0%
  14 Stockfish 14.1 211028    :    -109     16    1000   108  482  410  34.9%
  15 Stockfish 14 210702      :    -211     18    1000    33  396  571  23.1%
  16 KomodoDragon 3 avx2      :    -239     18    1000    27  354  619  20.4%
  17 Stockfish 13 210218      :    -307     20    1000    12  271  717  14.8%
  18 Stockfish 12 200902      :    -466     29    1000     2  127  871   6.5%
  19 Stockfish final HCE      :    -669     54    1000     0   43  957   2.1

 

You can download my Gamepairs Rescorer Tool right here