Paper: Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Summary by Adrian Wilkins-Caruana
Some kinds of problems are hard to solve in your head, like 768 × 364. For problems like this, it helps to break the problem down into many smaller problems and write down intermediate answers, like a multiplication algorithm does. LLMs are much the same: If we ask GPT-4 to predict the next token in the sequence: “768 × 364 = ”, chances are it’ll get the answer wrong. But, like people, LLMs are better when we let them take their time, such as instructing them to “think step by step.” Today’s paper is about a new way of getting LLMs to think using a “tree of thoughts.”
The tree of thoughts (ToT) method is one of several ways that we can try to make LLMs better at solving some kinds of tasks, such as creative writing or crossword puzzles. Other methods include:
Input-output (IO) prompting (asking the LLM to produce an output that’s the answer to the question in the input),
Chain of thought (CoT) prompting (asking the LLM to think step by step), and
Self-consistency (a majority vote of several chain-of-thought trials).
The image below compares these methods to the ToT method.
In the ToT approach, the intermediate problem-solving steps are progressively evaluated and selectively expanded upon. If you start with the input problem, “Combine the numbers ‘4 9 10 13’ and the mathematical operations ‘+ - × ÷’ to generate a result of 24” (the researchers call this prompt Game of 24), the ToT method uses an LLM to both propose new thoughts and evaluate the likelihood that a thought will lead to the correct solution. The image below shows how the LLM proposes and evaluates intermediate thoughts in Game of 24. (The proposal and scoring prompts shown below are specific to the Game of 24; other tasks require their own unique prompts.)
In 100 games of Game of 24, the ToT method had a success rate of 74%, while the other methods succeeded <10% of the time. In an open-ended creative writing task where the LLM was provided with four randomly generated sentences and must generate four coherent paragraphs that end with each sentence, respectively, the passages generated with ToT were scored as “more coherent” than CoT by GPT-4 and human evaluators. Finally, when it came to solving mini crosswords, the ToT method correctly solved 20% of provided crosswords, but IO and CoT succeeded less than 2% of the time. The IO and CoT methods got some of the letters and words in the crosswords correct, but overall the ToT method was 2–3 times more accurate.
The results in this paper show that ToT is a viable LLM-based problem-solving method for specific types of tasks, albeit with the additional human touch of hand-crafted prompts for prompt- and value-generation. But ToT might not be necessary for many tasks that LLMs are already really good at, like writing emails or summarizing a blog post. Also, ToT requires more LLM responses than other methods, which can be expensive. But what I like about the ToT approach is that it excels on problems where the best answer isn’t clear at the outset, and the problem space must be explored in a non-trivial way to succeed. One area where this kind of open-ended problem solving could be useful is robotics, where the environment might not be well understood. A ToT-like environment-exploration algorithm could help robots accomplish their objectives.