**Summary by Adrian Wilkins-Caruana and Tyler Neylon**

This paper introduces LLEMMA, a large language model (LLM) with a twist: It's a math whiz. It can understand and solve complex mathematical problems. Like a real mathematician, it knows how and when to use tools (such as writing computer programs) to more efficiently solve particular problems.

Training LLMs to specialize in mathematics isn’t new. In fact, some models, like Minerva, are already quite good at solving math problems. Unfortunately, the best such models aren’t available to the general public, and neither is the data that was used to train them. But not only is LLEMMA just as good as the best existing math models, both the LLEMMA model *and* its training data are open source. LLEMMA also uses its model parameters more efficiently than Minerva, as you can see in the figure below, which compares model size against the ability to solve problems from a standardized question set called the MATH benchmark.

LLEMMA is a fine-tuned version of Code LLaMA, a coding LLM that’s a fine-tuned version of LLaMA 2, a general-purpose LLM. The authors fine-tuned LLEMMA on three kinds of math data: They combined existing math datasets of scientific papers (arXiv) and web data teeming with mathematical content (OpenWebMath) with a new open dataset called AlgebraicStack. This new dataset contains code examples for numerical simulations, computer algebra systems, formal theorem provers, and several general-purpose programming languages, such as Python and C++. The authors found this new data in public locations such as github and stackoverflow. Together, these three datasets give LLEMMA a wide array of knowledge useful for mathematical problem-solving. And unlike Minerva’s training data, all three datasets are open source.

How do we know that LLEMMA isn’t just some math machine that follows a mechanistic problem-solving algorithm? Let’s look at a level 5 problem from the MATH benchmark, which is the highest difficulty level. We can see LLEMMA follows some standard procedures, such as decomposing complex fractions into the sum of two smaller fractions, and replacing converging sums with their limits. But, it also does two non-trivial things: It recognizes that swapping the summation (as shown in red) simplifies the problem, and it recognizes that the resulting sum telescopes (i.e., the terms highlighted in blue cancel out).

(In case you’re curious how an LLM works with math notation: I took a look at the training data, and it contains a lot of LaTeX notation for papers, such as the publicly available source for many papers on arXiv. It seems likely that the model simply learns LaTeX based on seeing it in the training data.)

LLEMMA is not quite the state-of-the-art in mathematical problem solving. While one variant of the model, LLEMMA 34B (with ~34 billion weights) performs roughly on par with the larger Minerva 62B on the MATH benchmark (43.1% vs. 43.4%), the much larger Minerva 540B manages to win out (50.3%). At the same time, it’s possible that a similarly sized version of LLEMMA might do better if we extrapolate from the existing 7B and 34B versions. You might wonder: Why is LLEMMA more efficient in its use of network weights? The authors guess that two things help: First, they are training a model that’s not only a pre-trained LLM, but that’s also trained to code. Second, they put together a large, high-quality dataset that may be better for training than what Minerva used.

LLEMMA can also generate formal proofs of theorems, which are proofs specified in a programming language that can be verified by a computer. The example below shows how LLEMMA can generate the following:

an informal, human-readable proof,

a formal proof for the problem in the Isabelle language, and

a formal proof in the Lean 4 language.

For informal-to-formal proofs, LLEMMA is about as good as another piece of software called Sledgehammer, the Isabelle language’s built-in proof automation tool. For formal-to-formal proofs, it’s about as good as ReProver, a model specifically trained for this task.

Finally, while it may not seem significant at first, making LLEMMA open is a pretty big deal. Even though other researchers have reported better models, LLEMMA is quite strong. Just as LLMs like GPT-4 can answer many general questions, specialized models like LLEMMA can help math students and mathematicians alike, letting them check their work or providing them hints when they’re stuck.

This is great! I often hear "LLMs will never be able to do math". This shows otherwise 😎

Can it prove that addition in the set of natural numbers (with usual succession) is commutative? If so, how long does it take and how many steps are in the proof?