AI can play chess well without using a search tree for future moves
[Paper: Grandmaster-Level Chess Without Search]
Summary by Oreolorun Olu-Ipinlaye
There’s a longstanding relationship between chess and AI. Chess played a key role in demonstrating a machine’s ability to effectively learn from data when DeepBlue, IBM’s chess engine, defeated champion Garry Kasparov in 1997. Since then, there have been several state-of-the-art chess engines, such as AlphaZero and Stockfish, which are powered by search.
What exactly does “search” imply in the context of chess engines? If you play chess, you already know that experienced players can think several moves ahead: “If I play this, my opponent will most likely play this, which will then let me play that.” Well, that’s exactly what search allows these chess engines to do — analyze future moves based on a potential move.
Novice- (like myself) and amateur-level players, on the other hand, play chess “in the moment,” one move at a time. Interestingly, several chess champions are purported to have said that they only think one move ahead — hmm, curious. With the vast amount of data at our disposal these days and the availability of large models, imagine if we could train a chess agent that only thinks one move ahead. Would it be as efficient as the current state-of-the-art chess engines? Does training on a large dataset using a model with tons of parameters make a search component unnecessary? This is precisely what this week’s paper investigates.
The researchers trained their chess engine differently than is typical. They trained three models, each of which predicts one type of score:
Action value: the value of making a move
State value: the value of an opponent’s position after a move
Behavioral cloning: an estimated victory probability after making a move
They trained these models using standard supervised learning, which is notable because most chess engines are trained using reinforcement learning algorithms.
For brevity's sake, I'll focus on the action-value model as the researchers found it to be ~6% better than the state-value model and ~13% better than the behavioral-cloning model. To train the action-value model, the authors curated a large dataset of chess games in an unconventional manner. Rather than hand-crafting features, they generated data in a somewhat automated fashion using a chess engine. They downloaded ten million games from Lichess (an online chess platform), and then extracted chess states (information about which pieces are where, whose turn it is to play, etc.) from these games. Finally, they used Stockfish 16 to estimate action-values for every legal action at a given state. Using this strategy, they were able to amass 15.32B action-value estimates to use as the training set. They curated the test set in a similar manner but with fewer games and a different time frame.
They also did something interesting regarding the data representation they fed into the model. They represented board states in FEN (Forsyth-Edwards Notation) format as fixed length ASCII-code strings of length 77 padded with a period if required. They then tokenized these FEN notations by designating each character as a unique token. Also, they stored actions in UCI (Universal Chess Interface) notation (e.g.: e2e4 represents the popular opening where a pawn is moved from square e2 to e4). The researchers tokenized actions by determining all possible legal moves, sorting them alphabetically, and then deriving the index of the action in question and returning it as a token. Afterwards, they concatenated the tokenized board states and actions to derive textual model inputs during training, a deviation from typical practice in other papers, where teams of researchers use some form of numerical representation as input.
The authors of this paper derived model targets by grouping action values into K discrete bins (K=128), which ensures that the task is structured as a classification problem. During training, they provided board states as inputs and action values as targets to a decoder-only transformer model with a context size of 79 and an output size of K. After training, they obtained a chess model which, when shown a board state and all legal actions, computes the confidence score of each legal move one at a time, then picks the action that provides the best win rate by taking the argmax of the confidence scores. Their training setup basically distilled the knowledge of Stockfish 16 (termed an Oracle) into a transformer model.
The authors trained three action-value models of different parameter sizes (9M, 136M, and 270M). Overall, the 270M-parameter model performed best. While playing against bots on Lichess, it had an Elo rating of 2299, outperforming GPT-3.5-turbo-instruct (1755); while playing against humans, it had an Elo rating of 2895! Basically, it played at grandmaster level.
Via ablation experiments, the authors also noted that using training with more data led to better model performance. Their findings imply that, by using a large model with the attention mechanism and a vast amount of data, you could model complex tasks in a supervised learning setup, a task that previously required some combination of reinforcement learning and heuristics. This also shows that transformer models can do more than next-token prediction and can be used on other tasks like classification. While it’s quite impressive to see this model play chess at grandmaster level without analyzing several moves ahead, keep in mind that it was trained for and performed at that level when playing blitz chess (a fast game that allows little reaction time). In blitz chess, there really isn’t time to think several moves ahead, so a chess engine devoid of a search strategy may be particularly well suited for this scenario.
The authors don’t claim that their model is perfect. They note that the predictors are often indecisive in the face of overwhelming victory (where more than one move significantly strengthens one’s hand for a checkmate) since many actions may end up with a maximum bin value. This causes the chess agent to play randomly rather than commit to a move that will lead to a checkmate in the fewest number of steps. This critique of the paper highlighted this issue, among others, as a potential red flag and argued that the agent can’t be said to play at a grandmaster level if its endgame is weak.
Personally, I feel the findings in this paper are quite impressive, though perhaps the researchers could have gotten better results by structuring the modeling task as a regression problem rather than a classification problem. But the fact that they attained grandmaster-level chess without using search is a big milestone, one that could potentially lead to chess models that are computationally simpler than the current state-of-the-art ones.