Paper: LIMA: Less Is More for Alignment

Jun 23, 2023

Summary by Adrian Wilkins-Caruana

Have you ever wondered how GPT-4 was trained? In a recent keynote, OpenAI cofounder and AI extraordinaire Andrej Karpathy divulged many of the details. The first, most substantial, and most expensive stage of training GPT-4 and models like it is the pre-training phase, where the model is trained to predict the next token on an internet-scale dataset. Karpathy then described three additional, more sophisticated training phases:

Supervised fine-tuning
Reward modeling
Reinforcement learning

He explained that supervised fine-tuning is a reliable way to make the model better at particular tasks, while the reward modeling and RL stages are very unstable and difficult to get right. Now, you may be thinking, “Ah, that must be GPT-4’s secret sauce.” Well, today’s paper argues that these last two stages really aren’t so critical. The authors assert that, so long as you use high-quality examples, supervised fine-tuning gets you most of the way to GPT-4.

The crux of the paper is something the authors call the Superficial Alignment Hypothesis:

A model’s knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users.

The hypothesis essentially means that those other, more sophisticated training stages don’t really “teach” GPT-4–scale LLMs anything new, they just help them learn how to speak and behave. If this is true, then it should be possible to turn a pre-trained model into one that’s competitive with GPT-4 using fine-tuning alone, as long as you have high-quality training examples.

In short: Train smart, not hard.

To test this hypothesis, the researchers used Meta’s 65B parameter LLaMA model, which is pre-trained, meaning it hasn’t been fine-tuned or further refined with reward modeling or RL. If their claim is valid, then we’d expect a carefully optimized LLaMA to perform comparably to GPT-4. So they decided to create a fine-tuned LLaMA model, which they named LIMA (Less Is More for Alignment).

To do this optimization, the researchers focused on compiling a high-quality dataset. They chose to refine LLaMA for question answering, so they compiled a dataset of 1,000 fine-tuning examples from community Q&A websites: Stack Exchange, wikiHow, and Reddit’s r/AskReddit and r/WritingPrompts (since these subreddits typically contain higher-quality training material than others). The researchers also manually compiled some Q&A prompts of their own for fine-tuning and testing, being careful to ensure the answers have a consistent tone and style, which is a trait of high-quality fine-tuning material.

After optimizing LIMA using a standard fine-tuning configuration, the researchers evaluated the quality of its generative output by asking people (crowd-sourced workers) which model’s output they preferred. The researchers also used GPT-4 to annotate which model’s output it preferred. They compared LIMA to Alpaca 65B (which is another fine-tuned LLaMA model), Google’s Bard, DaVinci003 (GPT-3 tuned with RL from human feedback), Claude (an LLM from Anthropic), and GPT-4.

The figures below show the results from the human (left) and GPT-4 (right) annotations. Both sets of annotations showed similar trends: Over the test set, LIMA’s was approximately on-par with Bard, was preferred to Alpaca and DaVinci003, but generally lost out to Claude’s and GPT-4’s answers. GPT-4’s winner/loser annotations were more consistent than the human annotations on a model-vs.-model basis, but its annotations typically matched the annotations of both the crowd workers and the authors themselves.

I think these results are really quite impressive. Remember, LIMA is fine-tuned with only 1,000 Q&A examples! LIMA performed better than DaVinci and Alpaca, and on-par with Bard, so the Superficial Alignment Hypothesis certainly does hold to some degree. That being said, we can see that the approach taken by these researchers only goes so far, and, at some point, a higher volume of training data trumps LIMA. There’s more so-so news, too: While LIMA held its ground when evaluated on examples similar to which it saw during training, it was thrown off by out-of-distribution prompts or adversarial examples.

While I think there will always be a place for very large, general-purpose foundation models like ChatGPT, GPT-4, and Bard, I think that more specialized models will really be the unsung AI heroes. These models may not have to be so large, since they would only need to accomplish very specific tasks. A LIMA-based fine-tuning strategy could prove to be an efficient and cost-effective way to develop these specialized models, potentially unlocking a myriad of other LLM applications we haven’t even thought of yet.

Learn and Burn

Discussion about this post