A new standard in openness for LLMs

Paper: OLMo: Accelerating the Science of Language Models

Mar 13, 2024

Summary by Adrian Wilkins-Caruana

ChatGPT is great, but do you know how it works? For most people, the answer is “no.” Even if you’re up to speed on the latest research in LLMs and have read all of OpenAI’s technical documentation and research papers, you still can’t know for certain exactly how ChatGPT works — they simply haven’t shared the details. This means that ChatGPT isn’t an “open” model.

Openness is a cornerstone of research since it lets the research community critique, find flaws in, and ultimately improve the functionality and reliability of others’ work. It’s particularly significant in LLM research, where the biases and potential risks of individual LLMs may be subtle and hard to fix. Even though some LLMs are more open than ChatGPT, such as Meta’s Llama models (because their source code and weights are made available), they’re still not fully open, since the source code and data used to train the Llama models isn’t open. On the other hand, LLMs that are truly open aren’t really in the same class as Llama, since they’re much smaller in scale. This is where the Open Language Model, OLMo, comes in: It’s a truly open LLM from the Allen Institute for Artificial Intelligence.

OLMo is a decoder-only, transformer-based LLM, just like Llama and other similar LLMs. There are three OLMo variants: 1B, 7B, and 65B. The 1B and 7B models have already been released; the researchers are still training the 65B model and plan to release it soon. As we’ll discuss below, OLMo is more than just the weights that comprise its model — it’s also code, data, logs, and (thanks to its openness) an invaluable asset for AI researchers. Here are some stats about the OLMo-7B model compared to other models with publicly-shared architectures:

There aren’t any hard-and-fast rules about what constitutes an open LLM, but I can think of three important qualities you might want. The first is that all resources used throughout the LLM’s development should be released. This means that even though Llama’s weights are released, it’s still not really that “open.” In contrast, the Allen Institute researchers are releasing:

The inference code and weights, so people can run OLMo,
The training code and data, so people can retrain OLMo,
The training logs and metrics, so people can compare their training attempts with the researchers’,
The evaluation code, so people can fairly compare other models to OLMo, and
The “adaptation” code, which is like a cherry on top that provides further training code for fine-tuning OLMo to be an instruction-following LLM like ChatGPT.

[If you’re curious, links for all the above are at the end of this article.]

The second quality that I think makes an LLM open is its license. Sometimes, when someone “releases” their code, people might be limited in how they can use it because of how the author decided to license it. Famously, the original Llama model was released with a license that forbade commercial use but, in an attempt to be more “open,” Meta released Llama 2 with a license that permits such use. The OLMo researchers are releasing their code under the Apache 2.0 license, which lets you use, modify, and distribute OLMo, including as part of derivative works. Importantly, the license doesn’t require those derivative works to be licensed under the same terms. This means that, for example, a company can modify OLMo, keep it to themselves, and use it for commercial purposes.

These two qualities — open code and open licensing — make OLMo open and accessible, but these two things alone don’t make it useful. That’s where the final quality comes in: language-modeling accuracy. OLMo’s is on-par with some of the best open-weight LLMs available. To train an LLM to this level requires substantial computing resources that most researchers don’t have access to. In experiments that determine how good an LLM is across eight different tasks, OLMo 7B keeps up with the best weight-accessible LLMs around, such as Falcon, Llama, and Llama 2. It’s a similar story for the instruction-tuned OLMo on instruction-based tasks, where it performs roughly on-par with Llama 2-Chat (better in some tasks, worse on others).

When Meta released the weights for the first Llama model a couple of years ago, it turbocharged LLM research. Suddenly, researchers who didn’t have the resources to train LLMs from scratch were no longer restricted to LLM-based research via limited and expensive APIs. Instead, they could fine-tune the weights and integrate the model into larger systems — for free. I predict that OLMo will ignite a similar fire in researchers, as it will open up new areas of investigation, like identifying the impact of small amounts of harmful training data or determining the best training regimens for LLMs.

!OLMo-links!

🏋️ weights

🧑‍💻 code

🔢 data

📈 eval

🧑‍🏫 fine-tune (aka “adaptation” or “open-instruct”)

🪵 logs (aka “training metrics”)

Learn and Burn

Discussion about this post