Stefan Pohl Computer ChessHome of famous UHO openings and EAS Ratinglist
Here you find experimental testruns, which are not part of my regular testwork.
2024/02/15 Experimental testrun of Revenge 1.0 for my UHO-Top15 ratinglist, in order to test, if my EAS-tool works as I predicted.
Author of Willow 4.0 engine on talkchess said this about my EAS-tool:
So, here the proof, that this is completely wrong and my EAS-tool works as I always predicted: I did a testrun of Revenge 1.0 (the strongest really aggressive playing engine besides Stockfish and Torch, but lightyears weaker than Stockfish and Torch, of course): Program Elo + - Games Score Av.Op. Draws 1 Stockfish 16 230630 : 3821 4 4 15000 73.8% 3628 45.8%
White Wins : 57796 (48.2 %)
bad avg.win
A: Most high-value sacrifices (3+ pawnunits): [1]:05.38% Revenge 1.0 avx2 [2]:03.61% Stockfish 16 230630 [3]:02.31% Rebel EAS avx2 [4]:02.25% Torch 1 popavx2 [5]:01.78% Obsidian 10.0 avx2 [2]:20.06% Stockfish 16 230630 [3]:15.98% Rebel EAS avx2 [4]:15.17% Torch 1 popavx2 [5]:15.14% KomodoDragon 3.3 avx2
[2]:02.85% Stockfish 16 230630 [3]:01.95% Torch 1 popavx2 [4]:01.87% KomodoDragon 3.3 avx2 [5]:01.15% Rebel EAS avx2
[2]:27.19% Torch 1 popavx2 [3]:23.61% Stockfish 16 230630 [4]:21.14% KomodoDragon 3.3 avx2 [5]:17.85% RubiChess 240112 avx2
[2]:071 Revenge 1.0 avx2 [3]:071 Stockfish 16 230630 [4]:072 KomodoDragon 3.3 avx2 [5]:074 RubiChess 240112 avx2
So, the clearly (very clearly!) weakest engine is on rank 1 in the EAS-ratinglist ! How awesome is that? 2023/02/22 Experimental testrun of Rebel 16.2 with different values of the Evalcorrect UCI-parameter. This option can be used to change the playing style of the engine. The default value is 202. Increasing the value should increase the engine aggressiveness. A 10000 games RoundRobin tournament was played. 60sec+600ms thinking-time, singlethread, no ponder, no bases, my UHO_2022_8mvs_+120_+129 openings were used. Download the games of this test right here Program Elo + - Games Score Av.Op. Draws 1 Rebel 16.2 default : 3617 6 6 4000 52.4% 3600 56.1%
Below the Engines Aggressiveness Scoring (EAS), calculated with my EAS-Tool (V5.21): bad avg.win Conclusions: The Evalcorrect-parameter seems quite meaningless. As you can see, the strength and the aggressiveness of Rebel are nearly identical with all Evalcorrect-values and a higher Evalcorrect-value seems to lower the aggressiveness instead of increasing it... 2022/10/14 Experimental testruns (3) of Pedone 3 with different Strength-parameter settings. Merged into one pgn-file. 1min+1sec, singlethread, no ponder, no bases, balanced openings (Feobos c3). Because Pedone 3 plays very aggressive and runs on Android Smartphones, too. So, it is a very interesting engine for playing against as a human (on an electronical chessboard) etc. Download Pedone 3 here (mention: Do not use Pedone 3.1, the successor, it plays not very aggressive!) Download the 40500 played testgames and statistics here (as you can see in the ratinglist, the strength-parameter is a little bit strange... It has a range of 0 up to 100, but several settings do not differ in strength...)
Program Elo + - Games Score Av.Op. Draws 1 Pedone 3.0 100 : 3350 75 75 2000 99.3% 2008 1.4%
White Wins : 18264 (45.1 %) 2022/07/03 Experimental testruns of 2 different TripleBrain "engines", using the aquiri-engine. Download all 16000 played games and the aiquiri-engine folder here First of all: Aiquiri does not run with all engines. For example: Koivisto 8.13 or Ethereal 13.75 did not work. So, if you use aiquiri, always check (with the Taskmanager), that "master.exe", "slave1.exe" and "slave2.exe" are running! I used cutechess-cli for my testruns. There I increased the timemargin-parameter to 2000 and set restart=on for the aiquire-engine (=reloading after each game finished). Timeparamters in aquiri were lowered to 35 for slaves (default 40) and 15 (default 20) for master. That worked. With my SPCC-testsetup (3min+1sec, singlethread, 20 games running simultaneously) I had only 8 timelosses in 16000 games - acceptable. I used my EAS-tool for measuring the aggressiveness of play of the engines (look here for more information) and the SPCC-Elo of the engines for the strength. I built 2 different TripleBrains: (1) with a strong but solid playing master and two weaker but very aggressive playing slaves and (2) with a weaker but very aggressive master and two stronger but solid playing slaves. Mention, that the master-engine only chooses between the 2 slave-moves (if there are 2 different moves), the master-engine never plays an own move!!! Here the results: TripleBrain (1)
TripleBrain (2) The results were (as expected by me): (1) The stronger master increases the Elo of the TripleBrain (compared to the slaves), but the aggressiveness fades away (the EAS-score is only around 50% compared to the EAS-scores of the slaves) (2) The weaker master decreases the Elo of the TripleBrain, but the aggressive playing master was not able to gain aggressiveness out of the solid-playing slaves (because you can not play aggressive, if you have to choose between 2 solid moves, only!). So, IMO, these 2 experiments show clearly, that the TripleBrain-idea is useless...A solid, strong master make aggressive slaves play stronger, but their aggressiveness fades away. An aggressive playing master is unable to make 2 stronger solid plaing slaves play more aggressive.
2022/01/20 Experimental testrun of Fat Titz 2 vs. Stockfish 220113 with very long thinking-time. Goal: Find out, if Fat Titz 2 can benefit from his bigger nnue-net (compared to Stockfish), when the thinking-time is very long...Thinking-time: 20min+10sec on singlethread, average game-duration 65 minutes!!! Extreme longtime testrun of Fat Titz 2 vs. Stockfish 210113 Program Elo + - Games Score Av.Op. Draws 1 Fat Titz 2 bmi2 : 3802 7 7 1000 50.3 % 3800 51.7 %
Games : 1000 (finished)
Individual statistics: Fat Titz 2 bmi2: 1000 (+244,=517,-239), 50.3 %
Gamepairs rescoring tool result: # PLAYER : RATING ERROR PLAYED W D L (%)
Individual statistics: Fat Titz 2 bmi2: 500 (+117,=272,-111), 50.6 % Conclusion: Fat Titz 2 does not benefit from his much bigger nnue-net (compared to Stockfish), even though the thinking-time was so long... Download games and statistics here 2021/11/24 Experimental RoundRobin tournament with 6 different playing-styles of KomodoDragon 2.5 (Default, Defensive, Positional, Human, Active, Aggressive), each style combined with MCTS on and off = 12 engine-settings. Tournament with 2'+1'' thinking-time on AMD Ryzen 3900 12-core (24 threads) notebook with 32GB RAM, singelthread-mode, no ponder, no bases (except for cutechess-cli, to end a game), cutechess-cli, classical, balanced 8 moves deep openings, played by humans (out of Megabase 2020, both players 2400 Elo or better). 100 rounds = 13200 games played. Download the played games here Program Elo + - Games Score Av.Op. Draws 1 Dragon 2.5 Default : 0 16 16 2200 89.1 % -409 20.8 %
White Wins : 4731 (35.8 %) 2021/10/09 Experimental RoundRobin tournament with 3 engines (Stockfish 211006, KomodoDragon 2.5 and KomodoDragon 2.5 MCTS), each with 5 different MultiPV-settings (1,2,3,5 and 7, were 1 is the normal, default playing mode). Goal: Measure, how much Elo is lost by calculating more than one PV-line. And to measure, if Dragon 2.5 MCTS has less Elo-loss, than the AlphaBeta-engines, when MultiPV is 3 or higher...This is the new testrun with Stockfish 211006 (fixed time-management in MultiPV-mode) instead of Stockfish 14. The games-download includes the games and the statistics of the first testrun, too, for comparsion. The ratings of the old testrun with Stockfish 14 are below, for comparsion. Tournament with 3'+1'' thinking-time on AMD Ryzen 3900 12-core (24 threads) notebook with 32GB RAM, singelthread-mode, no ponder, no bases (except for cutechess-cli, to end a game), cutechess-cli, classical, balanced 6 moves deep openings, played by humans (out of Megabase 2020, both players 2400 Elo or better). 100 rounds = 10500 games played. Same engines = same color... Download the played games here Program Elo + - Games Score Av.Op. Draws 1 Stockfish 211006 pv=1 : 0 13 13 1400 64.8 % -110 69.9 %
For comparsion, here the ratings with Stockfish 14 (buggy time-management in MultiPV-mode): Program Elo + - Games Score Av.Op. Draws 1 Stockfish 14 avx2 pv=1 : 0 13 13 1400 64.2 % -105 70.6 % Conclusions (new testrun): 1) The MCTS-mode is very good for MultiPV-analyzing. As you can see, all 5 KomodoDragon 2.5 MCTS MultiPV-engines are in a very small range of 9 Elo, only (!). 2) The Elo-difference of Stockfish 211006 and KomodoDragon 2.5 non-MCTS is: pv=1: 49 Elo / pv=2: 54 Elo / pv=3: 69 Elo / pv=5: 97 Elo / pv=7: 152 Elo. In contrast to the old testrun with SF 14, the Elo-difference of Stockfish 211006 and KomodoDragon 2.5 non-MCTS is increasing with higher number of pv-lines very clearly.
2021/03/02 Huge CloneWars tournament. Stockfish 13 vs. 10 Stockfish derivatives/clones. 20000 games.
60''+600ms thinking-time, singlethread, i7-8750H 2.6GHz (Hexacore) Notebook, Windows 10 64bit, no ponder, 5 Syzygy bases for cutechess-cli - none for the engines. All engines bmi2-binary. My Unbalanced Human Openings V2.00 6moves openings were used (low draw-rate and a wider Elo-spreading than classical opening-sets!). Program Elo + - Games Score Av.Op. Draws 1 CFish 210208 bmi2 : 3742 11 11 2000 52.7 % 3723 56.0 %
White Wins : 8750 (43.8 %) Conclusions: All derivatives/clones are playing measureable weaker than Stockfish 13, except CFish - no surprise, because CFish is running a little bit faster than Stockfish and has no other changes. Mention, that the used Unablanced Human Openings spread the Elo-distances around 2x bigger than a classical opening set...
Download the games here
2020/12/05 Huge experimental test (3x 7000 games) of the Eman 6.60 learning-feature.
I was curious, if (and how many) Eman 6.60 will gain by using it's learning-function. Eman 6.60 writes an experience-file, when he is playing. So I did 3 7000 games testruns, starting with no experience and then let Eman learn and learn. Each of the 3 testruns were 100% identical to the others, expcept, that Eman was allowed to learn and to keep the Eman.exp-file for the next testrun. All conditions like a normal Stockfish-testrun (see main-site): 3'+1'', singlecore, Hash 256MB, no ponder, 500 HERT openings. As you can see below, the results are a complete disappointment. The second testrun (using experience-file of the first testrun) gave +4 Elo, the third testrun (using experience-file of the first and the second testrun) gave no progress at all, even though the Eman.exp-file size was 72 Megabyte, after the third testrun was finished... Program Elo + - Games Score Av.Op. Draws 1 CFish 12 3xCerebellum : 3726 9 9 7000 86.1 % 3389 27.3 % *** rest of the ratinglist deleted ***
Individual statistics:
Eman 6.60 avx2 3rd_run: 3716 7000 (+3914,=3035,- 51), 77.6 %
RubiChess 1.9dev nnue : 1000 (+803,=195,- 2), 90.0 %
RubiChess 1.9dev nnue : 1000 (+802,=196,- 2), 90.0 %
RubiChess 1.9dev nnue : 1000 (+803,=197,- 0), 90.2 %
2020/02/05 Huge experimental test-tournament (31500 games (!)) of Stockfish 11 with 7 different Contempts (-40, -24, -15, 0, +15, +24 (=default of SF 11), +40).
Thinking-time: 1'+1'', singlethread, 256 Hash, no ponder, no endgame-bases for engines (5 Syzygy for cutechess-cli). 5 human moves openings. Download all played games here
Program Elo + - Games Score Av.Op. Draws 1 Stockfish 11 C=0 : 3558 4 4 9000 50.7 % 3553 79.3 %
Conclusions: Only the +40 and -40 Contempt results are somewhat weaker. All other Contempts are inside errorbar at the same level of strength.
2019/02/20 Testrun of the new Drawkiller balanced set (and testruns of Drawkiller tournament, Stockfish Framework 8moves and GM-4moves sets for comparsion).
3 engines played a RoundRobin (Stockfish 10, Houdini 6 and Komodo 12), with 500 games in each head-to-head, so each engine played 1000 games. For each game one opening-line was chosen per random by the LittleBlitzerGUI. Singlecore, 3'+1'', LittleBlitzerGUI, no ponder, no bases, 256 MB Hash, i7-6700HQ 2.6GHz Notebook (Skylake CPU), Windows 10 64bit
In the Drawkiller balanced sets, all endposition-evals (analyzed by Komodo) of the opening lines are in a very small interval of [-0.09;+0.09]. The idea is, that this should lead to wider Elo-spreading of the Engine ratings, which makes the Engine rankings much more statistically reliable (or a much lower number of played games is needed, to get the results out of the errorbar-arrays). Of course, on the other hand, this concept leads to little bit higher draw-rates... Let's see, if it worked:
Drawkiller balanced:
Program Elo + - Games Score Av.Op. Draws 1 Stockfish 10 bmi2 : 3506 11 11 1000 70.9 % 3347 36.2 %
Elo-spreading (1st to last): 204 Elo Draws: 37.9%
Drawkiller tournament:
Program Elo + - Games Score Av.Op. Draws 1 Stockfish 10 bmi2 : 3494 11 11 1000 68.9 % 3353 34.2 %
Elo-spreading (1st to last): 174 Elo Draws: 36.1%
GM_4moves:
Program Elo + - Games Score Av.Op. Draws 1 Stockfish 10 bmi2 : 3475 11 11 1000 65.4 % 3363 53.2 %
Elo-spreading (1st to last): 130 Elo Draws: 56.3%
Stockfish framework 8moves:
Program Elo + - Games Score Av.Op. Draws 1 Stockfish 10 bmi2 : 3463 11 11 1000 63.0 % 3369 59.7 %
Elo-spreading (1st to last): 114 Elo Draws: 61.3%
Conclusions:
1) The Drawkiller balanced idea was a success. The draw-rate is a little bit higher, than Drawkiller tournament (that is price, we have to pay for 2)), but look at point 2) and mention, that even this little higher draw-rate is still much, much lower, than the draw-rate of any other non-Drawkiller openings set...
2) The Elo-spreading, using Drawkiller balanced, was measureable higher, than with any other openings-set. That makes the Engine rankings much more statistical reliable. Or a much lower number of played games is needed, to get the results out of the errorbar-arrays: Example: Compared to the result of Stockfish framework 8moves openings, the Elo-spreading of Drawkiller balanced is nearly doubled, which means, you can have a doubled errorbar-array size for the same statistical reliability of the Engine rankings in a tournament / ratinglist. Mention, that you have to play 4x more games to half the size of an errorbar! That means, if you are using Drawkiller balanced openings, you have to play only 25%-30% amount of games, which you have to play, when using Stockfish Framework 8move openings for the same statistical result-quality of engine rankings (!!!) - how awesome is that?!?
2019/01/26 Testrun of the Skill-Levels of Stockfish 10
I made two large testruns with Stockfish 10, playing RoundRobin vs. itself with different Skill-Levels. First testrun: Level 20-10 (11000 games, 1'+1'', singlecore) Second testrun: Level 10-0 (5500 games 1'+1'', singlecore) Then both game-pools were linked together and ORDO-calculated (fixed to 3450 Elo to Stockfish 10, Level 20, which is the Elo of Stockfish 10 in the CEGT-ratinglist (40m/4', singleCPU)). Specs: Intel Quadcore-Notebook (SF 10 around 1.4 Mn/s in singlecore-mode), LittleBlitzerGUI, Stockfish Framework 8move-openings. No ponder, no bases. 256MB Hash per engine.
Program Elo + - Games Score Av.Op. Draws 1 Stockfish 10 bmi2 (100%) : 3450 47 47 2000 98.5 % 2601 2.8 %
(Stockfish 10: 3450 Elo is the CEGT-ranking 40m/4'), The percent-numbers in brackets are the value of the "strength-meter" in the Droidfish-App for Smartphones...
2019/01/06 One of the biggest opening-sets testings of all time!
8 opening-sets were tested: Drawkiller tournament, SALC V5, Noomen (TCEC openings Season 9-13 Superfinal and Gambit-openings), Stockfish Framework 2-moves and 8-moves openings, 4 GM moves (out of MegaBase 2018, checked with Komodo), the HERT set by Thomas Zipproth and FEOBOS v20.1 contempt 3 (using contempt 3 openings is recommended by the author, Frank Quisinsky). 7 engines played a 2100 games RoundRobin-tournament with each opening-set (not openings-set playing vs. another opening-set!). For each game one opening-line was chosen per random by the GUI. 7 engines played round-robin: Stockfish 10, Houdini 6, Komodo 12, Fire 7.1, Ethereal 11.12, Komodo 12.2.2 MCTS, Shredder 13. = 100 games were played in each head-to-head competition. In each round-robin, each engine played 600 games. Singlecore, 3'+1'', LittleBlitzerGUI, no ponder, no bases, 256 MB Hash, i7-6700HQ 2.6GHz Notebook (Skylake CPU), Windows 10 64bit. 3 games running in parallel, each testrun took 3-4 days, depending on the average game-duration. Draw adjucation after 130 played moves by the engines (after finishing opening-line)
First of all the main question: Why are low draw-rates and wide Elo-spreadings of engine testing-results better? You find the answer here
This excellent experiment of Andreas Strangmueller shows without any doubt, that: The more thinking-time (or faster hardware, thats the same!) the computerchess gets, the more the draw-rates climb and the more the Elo-spreadings shrink. So, it is only a question of time, that the draw-rates will get so high and the Elo-spreading of testings-results will get so small, that engine-testing or engine-tournaments will no longer give any valuable results, because the Elo-differences of results will always stay inside the errorbars, even with thousands of played games. So, it is absolute necessary to lower the draw-rates and raise the Elo- spreadings, if computerchess shall survive the next decades! Therefore the follwing conclusions of this huge experiment with different opening-sets:
1) The Drawkiller openings are a breakthrough into another dimension of engine-testing: The overall draw-rate (27%) is nearly halved, compared to classical openings sets (FEOBOS (51.3%), Stockfish Framework 8moves openings (51.9%)) AND the Elo-spreading is around +150 Elo better (!!), so the rankings are much more stable and reliable, because the errorbars of all results are nearly the same in all testruns. And the average game-duration, using Drawkiller, was 11.5% lower, than using a classical opening-set. So, in the same time, you can play more than +10% games on the same machine, which improves the quality of the results, too, because the errorbars get smaller with more played games. Download the future of computerchess (the Drawkiller openings): here
2) The order of rank of the engines is in all mini-ratinglists generated by ORDO out of these testruns exactly the same. So, what we learn here, is, that it does not matter, if an opening-set contains all ECO-codes (FEOBOS does!) or not (Drawkiller, SALC V5 do definitly not!). The order of rank of engines in a ratinglist is exactly the same! So, the over and over repeated statement of many people, that using all or the mostly played ECO-codes (by human players) in an opening-set is important for engine-testing, because otherwise the results are distorted, is a FAIRY TALE and nothing else !!!
3) At the bottom, I added the CEGT and CCRL ratinglists with the same engines, which were used for this project (nearly the same versions (Ethereal 11 instead Ethereal 11.12 for example)). There you can see, that the ranking in these ratinglist is exactly the same, too. So, what we learn here, is, that the over and over repeated statement of many people, that it is necessary to test engines versus a lot of opponents for a valid rating/ranking is a FAIRY TALE, too: 6 opponents gave the same ranking-results in all testruns of this project, than in CEGT in CCRL with much, much more opponents.
4) The FEOBOS-project was a complete waste of time and resources. It took more than one year of work and calculations, but the results are not measureable better, than the results of Stockfish Framework 8moves openings: The overall draw-rate is 0.6% better (=nothing). The Elo-spreading is +21 Elo better (=nearly nothing). And the prime-target of FEOBOS was, to avoid early draws: The number of early draws until move 10 (after leaving the opening-line) is 0, but with Stockfish Framework 8moves openings, there are only 6 games draw until 10 moves. Out of 2100 games (=0.29%). And the number of early draws until move 20 and move 30, FEOBOS is slightly worse, than the Stockfish Framework 8moves openings. So, even the prime-target of FEOBOS failed.
5) The Noomen openings, which lower the draw-rate in the TCEC superfinals and the Noomen gamibt-lines, lowered the overall draw-rate (43%) compared to classical openings sets (FEOBOS (51.3%), Stockfish Framework 8moves openings (51.9%)), but not very much (compared to Drawkiller (27%)). And the Elo- spreading is only a little bit better, than the Elo-spreading of the classical opening-sets. So, these Noomen-openings are a little improvement, but not more. And the number of openings is very, very small (only 477 lines): too small for building an opening book. Drawkiller tournament contains 6848 lines and can be used as opening-book, of course.
Draws : 566 (27.0 %)
Draws : 829 (39.5 %)
Draws : 902 (43.0 %)
Draws : 929 (44.2 %)
Draws : 975 (46.4 %)
Draws : 1013 (48.2 %)
Draws : 1077 (51.3 %)
Draws : 1090 (51.9 %)
Long summary (with ratinglists):
Avg game length = 389.777 sec
Program Elo + - Games Score Av.Op. Draws 1 Stockfish 10 bmi2 : 3459 23 23 600 82.6 % 3157 21.8 %
Elo-spreading: from first to last: 448 Elo Number of early draws:
Games : 2100 (finished)
SALC V5:
Avg game length = 399.781 sec
Program Elo + - Games Score Av.Op. Draws 1 Stockfish 10 bmi2 : 3404 21 21 600 78.3 % 3166 32.8 %
Elo-spreading: from first to last: 341 Elo Number of early draws:
Games : 2100 (finished)
Noomen (TCEC openings Season 9-13 Superfinal and Gambit-openings (477 lines)):
Avg game length = 405.223 sec
Program Elo + - Games Score Av.Op. Draws 1 Stockfish 10 bmi2 : 3388 20 20 600 76.8 % 3169 39.7 %
Elo-spreading: from first to last: 312 Elo Number of early draws:
Games : 2100 (finished)
Stockfish Framework 2moves openings:
Avg game length = 430.108 sec
Program Elo + - Games Score Av.Op. Draws 1 Stockfish 10 bmi2 : 3395 20 20 600 77.5 % 3168 35.0 %
Elo-spreading: from first to last: 333 Elo Number of early draws:
Games : 2100 (finished)
4 GM moves (out of MegaBase 2018, checked with Komodo):
Avg game length = 449.414 sec
Program Elo + - Games Score Av.Op. Draws 1 Stockfish 10 bmi2 : 3396 20 20 600 77.5 % 3167 37.3 %
Elo-spreading: from first to last: 330 Elo Number of early draws:
Games : 2100 (finished)
HERT set (500 pos):
Avg game length = 442.339 sec
Program Elo + - Games Score Av.Op. Draws 1 Stockfish 10 bmi2 : 3384 20 20 600 76.3 % 3169 42.2 %
Elo-spreading: from first to last: 316 Elo Number of early draws:
Games : 2100 (finished)
FEOBOS v20.1 contempt 3:
Avg game length = 437.481 sec
Program Elo + - Games Score Av.Op. Draws 1 Stockfish 10 bmi2 : 3365 19 19 600 73.9 % 3173 45.5 %
Elo-spreading: from first to last: 302 Elo Number of early draws:
Games : 2100 (finished)
Stockfish Framework 8moves openings:
Avg game length = 438.899 sec
Program Elo + - Games Score Av.Op. Draws 1 Stockfish 10 bmi2 : 3363 19 19 600 73.9 % 3173 44.8 %
Elo-spreading: from first to last: 281 Elo Number of early draws:
Games : 2100 (finished)
For comparsion:
CEGT 40/4 ratinglist (singlecore):
1 Stockfish 10.0 x64 1CPU 3450
Elo-spreading: from first to last: 298 Elo
CCRL 40/4 ratinglist (singlecore):
1 Stockfish 10 64-bit 3498
Elo-spreading: from first to last: 229 Elo by bayeselo. (With ORDO: 276 Elo)
2017/10/22 Some days ago, I had the idea, to filter half-closed positions out of my SALC V3 opening-set. Which means, that in the endpositions of the opening-line, following conditions had to be true: 1) On d-line or e-line at least one white and one black pawn (=one of both center-lines closed) The idea is, that in these positions, the probability of fast and many capturing-moves is much lower, so it should took I did a testrun with these positions (with the exact same testing-conditions like the experimental testruns of the experiment below from 2017/10/17, so the result are comparable). The result is really surprising and much better, than I expected.
Games Completed = 1000 of 1000 (Avg game length = 920.236 sec) The result is impressive. The draw-rate (48.8%) is more than -5% lower than using "normal"-SALC (53.9%) and -14.6% lower than using the Stockfish-Framework 8move-openingset (63.4%)(this means, the number of draws is 23% lower with SALC half-closed!!!) - that is a huge step forward on my mission to prevent computerchess from draw-death! And the Elo-differences of the engine-scores are not getting smaller (which would happen, when the opening-positions had huge advantages for white or black), they are getting higher (score of asmFish vs Komodo with Standard-openings: 60.3% and with SALC half-closed: 63.1%)! And take a look on the average game length: 1036 sec with Standard-openings and only 920 sec with SALC half-closed: You need 11.2% less time, using SALC half-closed, for the same number of games. This testrun (3 games played simultaneously) ran nealry a half day shorter, than the testrun with Standard-openings.
2017/10/17 After the release of the FEOBOS v10 opening-books and files (by Frank Quisinsky and Klaus Wlotzka), with the new "contempt-books/opening-sets", I was curious to see, if the opening-set with the highest contempt 5 (means, that none of the 10 analyzing engines had a 0.00 evaluation in any opening-line endposition), could lower the draw-rate in engine-testing (compared to my SALC-openings and the standard 8-move opening-set of the Stockfish-framework). In May, 2017, I did 3 huge testruns (using SALC, Stockfish-opening-set and FEOBOS v3 beta). Now, I tested FEOBOS v10 Contempt 5 opening-set with the exact same conditions, so there was no need to replay the testruns, using SALC and using Stockfish-framework openings...(scroll down here, to find the 3 experimental testruns in 2017/05/19).
asmFish played 1000 games versus Komodo 10.4 with all 3 books/opening sets (=3000 games). Not bullet-speed, but 5'+3'' (!), singlecore, 256 MB Hash, no pondering, both engines with Contempt=+15. LittleBlitzerGUI (in RoundRobin playmode, in which for each game, one opening position is chosen per random out of an epd-openings file).
Games Completed = 1000 of 1000 (Avg game length = 1026.116 sec) Settings = RR/256MB/300000ms+3000ms/M 450cp for 4 moves, D 120 moves/EPD:C:\LittleBlitzer\SALC_V2_10moves.epd(10000) 1. asmFish 170426 x64 620.5/1000 351-110-draws: 539 (L: m=0 t=0 i=0 a=110) (D: r=149 i=231 f=38 s=0 a=121) (tpm=6659.0 d=30.93 nps=2552099) 2. Komodo 10.4 x64 379.5/1000 110-351-539 (L: m=0 t=0 i=0 a=351) (D: r=149 i=231 f=38 s=0 a=121) (tpm=6920.9 d=26.71 nps=1619591)
Games Completed = 1000 of 1000 (Avg game length = 1036.164 sec) Settings = RR/256MB/300000ms+3000ms/M 450cp for 4 moves, D 120 moves/EPD:C:\LittleBlitzer3\34700_ok.epd(32000) 1. asmFish 170426 x64 603.0/1000 286-80-draws: 634 (L: m=0 t=0 i=0 a=80) (D: r=148 i=232 f=39 s=1 a=214) (tpm=6334.2 d=31.54 nps=2570164) 2. Komodo 10.4 x64 397.0/1000 80-286-634 (L: m=0 t=2 i=0 a=284) (D: r=148 i=232 f=39 s=1 a=214) (tpm=6473.6 d=27.00 nps=1614400)
Conclusions: The FEOBOS v10 Contempt 5 positions lowered the draw-rate compared to Stockfish opening-set from 63.4% to 60.0% and the number of 3fold-draws from 14.8% to 13.9%. That is a small, but measureable progress. Still far away from the low draw-rate, the SALC-openings have (SALC lowered the draw-rate from 63.4% to 53.9% (!)), but mention, that FEOBOS plays a wide variety of all openings,and SALC plays only lines, where white and black castled to opposite sides of the chessboard.
2017/09/01 I measured the speed of Stockfish-compiles (abrok, ultimaiq and BrainFish (without Cerebellum-Library, Brainfish is identical to Stockfish). Stockfish C++ code from 170905, measured with fishbench (10 runs each version), i7-6700HQ 2.6 GHz Skylake CPU. These are the results: abrok modern : 1.557 mn/s ultimaiq modern : 1.660 mn/s brainfish modern: 1.729 mn/s
bmi2:
So, the ultimaiq-compiles are around 6% faster than the abrok-compiles, but BrainFish is around 10% faster than abrok!!! From now, I will use the BrainFish-compiles (without Cerebellum-Library) for my Stockfish-testruns, because these are the fastest compiles at the moment and the results are better comparable with the BrainFish-testruns, when BrainFish uses the Cerebellum-Library.
2017/08/23 Using the new HERT openings-set (by Thomas Zipproth) for my Stockfish-testing was a great opportunity to compare the gamebases played with HERT (contains positions selected from the most played variations in Engine and Human tournaments) and played with my SALC openings (SALC means: only positions with castling to opposite sides, both queens still on board. The idea was to lower the draw-rate in computerchess and make the games more tactical and thrilling, without distorting the results of engine-tests and engine-tournaments). So, here the results. Both gamebases were played with 3'+1'', singlecore, 512 MB Hash. The only difference was the opening-set (HERT / SALC)...
HERT:
1 Stockfish 170526 bmi2 : 3346 7 7 5000 71.3 % 3171 45.6 %
Elo-differences: 1-2: 32
average game length: +13.7% compared to SALC games (moves) White Wins : 5129 (34.2 %)
SALC:
1 Stockfish 170526 bmi2 : 3359 7 7 5000 72.7 % 3168 39.9 %
Elo-differences: 1-2: 32
White Wins : 5476 (36.5 %)
Conclusions: 1) SALC lowers the draw-rate a lot (35.8%) , compared to the HERT openings-set (42.8%) - mention, that the HERT-set was optimized for a low draw-rate. Thomas Zipproth has chosen only lines, which were not too drawish. Using other "classical" openings-sets should lead to a higher draw-rate, than using HERT !!! 2) The order of rank is the same for all engines in both gamebases - no distorted results with SALC.
5) At the moment, using a classical openings-set (like HERT) or book is OK, when playing with engines with a huge Elo-difference and when using a short thinking-time. But if you play only with very strong engines (with a small Elo-difference) and / or with very long thinking-time, than using SALC is strongly recommended, because in those cases, the draw-rate increases a lot. And in the future, when the hardware gets faster and faster, the draw-rate of computerchess will increase more and more (of course). Then using SALC will be the only solution, preventing the "draw-death" of computerchess... So, do not hesitate and download the complete SALC-package (opening-books and more than 12000 opening-positions (PGN and EPD)) here
2017/05/19 In the end of 2016, I released the 2.0 version of my SALC opening book. The idea was to create a book, which lowers the draw-rate in computerchess, because the draw-rate increases more and more, when the engines get stronger and the hardware gets faster. In online engine-tournaments and in the TCEC-tournament, the draw-rates are already around 85% and so, the “draw-death“ of computerchess is coming closer and closer. As you can see below (experimental testruns 2016/12/09), my SALC V2.0 book lowered the draw-rate a lot in a Stockfish 8 selfplay testrun (compared to a classic opening book/position set)(from 83% to 68.2% (!)). But in the last months, some people criticized, that the openings in the SALC book just gave a huge advantage for one color, which lowers the number of draws. It is clear, that this way of creating a book would work: if all lines of a book would give one color an advantage of +9, the draw-rate would be (of course...) 0%...But on the other hand, the scores in an engine-tournament, using such book, would be 50% for all engines, because we would have a random distribution of the advantage of the opening lines, if the number of played games is high enough. Settings = RR/256MB/300000ms+3000ms/M 450cp for 4 moves, D 120 moves/EPD:C:\LittleBlitzer\SALC_V2_10moves.epd(10000) Time = 945199 sec elapsed, 0 sec remaining 1. asmFish 170426 x64 620.5/1000 351-110-draws: 539 (L: m=0 t=0 i=0 a=110) (D: r=149 i=231 f=38 s=0 a=121) (tpm=6659.0 d=30.93 nps=2552099) 2. Komodo 10.4 x64 379.5/1000 110-351-539 (L: m=0 t=0 i=0 a=351) (D: r=149 i=231 f=38 s=0 a=121) (tpm=6920.9 d=26.71 nps=1619591)
Games Completed = 1000 of 1000 (Avg game length = 1049.395 sec) Settings = RR/256MB/300000ms+3000ms/M 450cp for 4 moves, D 120 moves/EPD:C:\LittleBlitzer2\FEOBOS_v03+.epd(24085) Time = 1039157 sec elapsed, 0 sec remaining 1. asmFish 170426 x64 601.5/1000 293-90-draws: 617 (L: m=0 t=0 i=0 a=90) (D: r=132 i=221 f=38 s=1 a=225) (tpm=6315.9 d=30.83 nps=2477078) 2. Komodo 10.4 x64 398.5/1000 90-293-617 (L: m=0 t=0 i=0 a=293) (D: r=132 i=221 f=38 s=1 a=225) (tpm=6424.5 d=26.49 nps=1583220)
Games Completed = 1000 of 1000 (Avg game length = 1036.164 sec) Settings = RR/256MB/300000ms+3000ms/M 450cp for 4 moves, D 120 moves/EPD:C:\LittleBlitzer3\34700_ok.epd(32000) Time = 1036719 sec elapsed, 0 sec remaining 1. asmFish 170426 x64 603.0/1000 286-80-draws: 634 (L: m=0 t=0 i=0 a=80) (D: r=148 i=232 f=39 s=1 a=214) (tpm=6334.2 d=31.54 nps=2570164) 2. Komodo 10.4 x64 397.0/1000 80-286-634 (L: m=0 t=2 i=0 a=284) (D: r=148 i=232 f=39 s=1 a=214) (tpm=6473.6 d=27.00 nps=1614400)
2016/12/09: Some weeks ago, I created my SALC opening book for engine-engine matches. In all lines (created out of 10000 human-games, all lines 20 plies deep (all lines checked with Komodo 10.2 (20'' per position, running on 3 cores), evaluation inside of [-0.6,+0.6])), white and black castled to opposite sides, both queens still on the board. The idea is, to get more attacks to the king and a lower draw rate. Because the draw rate in computerchess increases more and more, the stronger the engines and the faster the hardware gets. For my Stockfish bullet-testruns, I use 500 SALC-positions since 2014, which lowered the draw rate a lot. To verify, how much the draw rate is lowered by these new book / opening-positions set, I did two testruns. 3000 games each (=6000 games). Stockfish 8 in selfplay. 70''+700ms thinkingtime, singlecore, LittleBlitzerGUI (using the 10000 positions epd-files, playing in RoundRobin-mode, in which for each game one epd-position is chosen per random).
I think, the result is really impressive... 2016/03/12: Testrun of 3 new Stockfish-clones. Stockfish played 1000 games against them (LittleBlitzerGUI, singlecore, 70''+700ms, 128 Hash, no ponder, no bases, no largepages, 500 SALC-openings). None of the clones is stronger (no surprise), so don't waste your time with this "engines". The new popcount-versions of DON are not running on my system (and the LittleBlitzerGUI), so I could not test DON.
Program Elo + - Games Score Av.Op. Draws 1 Stockfish 160302 x64 : 3300 7 7 3000 51.9 % 3287 65.0 %
2015/02/20: A little "Clone-Wars" testrun of Stockfish 6 against 5 of its clone-engines. (70''+700ms, singlecore, SALC-openings, 1000 games-Gauntlet). As you can see, none of the 5 clones is really measureable stronger (all results in a +/-1% score-interval and clearly inside errorbar).
Program Elo + - Games Score Av.Op. Draws 1 Pepper 150213 x64s : 3251 13 13 1000 51.0 % 3243 60.9 %
|