Learn and Burn

Measuring a model's understanding — starting with path-finding

Unbox Research — Tue, 26 Nov 2024 13:24:45 GMT

Summary by Oreolorun Olu-Ipinlaye

Machine learning is all about finding a function that fits a dataset and works well on new, unseen data. Take a basic linear regression task: You’ve got some feature x and you want to predict a target y. By training a model, you end up with a function y = mx. In very simple terms, we could say we’ve taught a model y = mx. Generative sequence models aren’t as easy to picture but they work similarly. Recent research has shown that they learn what’s called a world model, which lets them produce meaningful responses — and maybe even reason. But how do we figure out how good these world models are? That’s the question today’s paper looks to answer.

Illustration by Giulia Zerbini

So, what exactly is a "world model?" The authors explain it as a way to represent a set of states and the rules that determine how those states change. In simpler terms, a world model is how a generative sequence model understands and applies those rules. This paper points out a problem with how we currently evaluate these models. Right now, we mostly treat them as fancy next-word predictors, checking how well they guess the next token in a sequence. But the authors argue there’s a better way: Instead of focusing on the immediate next token, we should look further down the line. Those later tokens are harder to predict, so evaluating them would give us a clearer picture of how good the model’s world understanding really is.

To make this clearer, the authors use the game Connect 4 as an analogy. At the start of the game, when the board is empty, it’s pretty easy for a model to predict a valid move — disks can go anywhere. But as the board fills up and valid moves become fewer, the model is more likely to make mistakes and predict invalid ones. This shows that a model with a stronger world model will do better as the game progresses and the sequence gets more challenging. The takeaway? Instead of focusing on individual predictions, we should evaluate models on how well they handle longer, more complex sequences.

The researchers referenced the Myhill-Nerode theorem to create two metrics. The theorem basically says that if two sequences lead to the same state, their continuations should be the same; but if they lead to different states, their continuations should differ. Here’s how they turned that idea into metrics:

Sequence compression: Checks whether the model predicts the same continuation for two sequences that end in the same state.
Sequence distinction: Checks whether the model predicts different valid continuations for sequences that end in distinct states.

To test their metrics, the researchers trained transformer models on a dataset of New York City taxi rides, turning each trip into sequences of turn-by-turn directions, culminating in over 120M sequences. The figure below shows what these sequences look like. The numerical figures represent start and stop nodes, respectively (each intersection is taken as a node and assigned a unique index) and the letters represent cardinal directions. They trained two models, one with 89M parameters and one with 1.5B parameters. Impressively, these models could predict valid routes between intersections and often even find the shortest path to a given destination. But deeper analysis revealed that their internal "world models"—representations of NYC’s street map—were incoherent.

Here’s how the team put their metrics to the test. For sequence compression, they picked a specific intersection (a “state”) and found two different routes leading to it. Then, they checked whether the model gave the same next turn for both routes. If it did, that meant the model understood that these were just two paths converging on the same destination. For sequence distinction, they looked at two different intersections and their respective routes. The goal? See whether the model’s predictions made sense for each intersection — in other words, were the turns even valid? Did they line up with the real-world map? The results were mixed: The models were poor on sequence compression, failing to recognize when routes led to the same place. But for sequence distinction, the 1.5B-parameter model did pretty well, while the 89M-parameter model struggled.

You might be thinking, "If these models can find the shortest path so accurately, why do these metrics even matter? Does having a solid world model really make a difference?" The researchers found out that it does. When the researchers threw in detours, the models struggled to reroute to the destination. Basically, when things get messy, models with weak world models perform poorly. To make this clearer, the team even reconstructed what the transformer’s world model thought the map looked like. It looks really different from the actual map:

This paper feels important because it highlights that just because transformers perform well on specific tasks doesn’t mean they’ve actually nailed down a solid world model — even if older evaluation metrics made it seem like they had. And that’s a problem because if a world model isn’t strong enough, the system is way more likely to stumble when you throw unexpected changes its way.

Making LLMs scalable by replacing weights with learnable tokens

Unbox Research — Tue, 19 Nov 2024 00:35:38 GMT

Summary by Adrian Wilkins-Caruana

[pdf on arxiv]

Has it ever occurred to you that, somewhere in OpenAI’s ranks, there’s an engineer who pressed a button that triggered what in all likelihood amounted to GPT-4’s multi-million dollar training process? As an engineer myself, I’m sweating just thinking about it. To make an LLM larger, researchers typically increase the number of layers and tokens, as well as the length of the model’s weight vectors. Unsatisfied with this trio of variables, Wang et al. have designed a new LLM approach that offers much cheaper training. The crux of their idea is that the model’s parameters are tokens, just like its input! They call their approach Tokenformer.

Illustration by Giulia Zerbini

Before diving into the details, it’s helpful to first try to understand why the researchers might have explored this idea in the first place. In a regular transformer, the computation can be split into categories: one where tokens interact with static model parameters (e.g., in linear projections of queries, keys, and values, as well as the post-attention linear projections), and another where tokens interact with themselves (e.g., self-attention). The researchers noticed that the token-parameter interactions accounted for the lion’s share of computation, so it’s easy to imagine that they might have discovered Tokenformer by trying to make a model that’s more reliant on token-token interactions. In fact, Tokenformer relies entirely on token-token interactions.

Now let’s see how Tokenformer works. Because it only uses token-token interactions, all the linear projections need to be replaced with attention layers, which the researchers call Pattention layers (named from the phrase “token-parameter attention”); they’re kind of like cross-attention layers. Actually, Tokenformer’s main attention layer consists of several Pattention modules, and a vanilla self-attention module. As tokens come into the main attention module, Tokenformer processes three copies of them in parallel in separate Pattention layers. Each Pattention layer combines the inputs with two sets of parameter tokens, which the researchers call key and value parameter tokens. The outputs of these three copies form the queries, keys, and values for the regular token-token attention module. Tokenformer then processes the post-attention projections using Pattention layers, too. Tokenformer’s attention layer is depicted below on the left, and the Pattention layer is on the upper-right.

By adopting this strange configuration, Tokenformer gains a new ability to scale itself by progressively adding more parameter tokens during training. The lower-right quadrant of the figure above shows this process. To do this, Tokenformer first undergoes some initial training with a modest amount of parameter tokens. Then, the researchers duplicated each sequence of parameter tokens, concatenating onto the end of the existing parameter token sequence before they resumed training. This process can be repeated several times to scale up further, as shown below. We can see that even with fewer overall parameters, Tokenformer models typically are better at predicting the next word (lower perplexity) with less training time than a regular transformer.

You might be wondering, “If we can use parameters as tokens, then what’s the difference between tokens and parameters in the first place?” There’s several ways to look at that question. One is that, after training, tokens can change but parameters typically stay the same. By this definition, the parameter tokens are still really just parameters, since they don’t change. Another way to look at it is that the length of a token’s vector is typically smaller than the length of a parameter’s vector. So parameter tokens are more like tokens in this respect, since they have the same dimensionality as regular tokens. Personally, I think the distinction between parameters and tokens is kind of arbitrary, and that it might be better to call these parameter tokens “token-like parameters” because they’re more like parameters than tokens. But that’s just my opinion.

Another thing you might be wondering is why this approach even works. The key change is that Tokenformer swaps linear projections (i.e., y=σ(xA^T+b), where A is a weight vector, b is a bias vector, and σ is, counterintuitively, a non-linearity) for Pattention layers. But, as it so happens, Pattention layers look an awful-lot like linear projections: Input tokens are combined with keys in the same way, while the value parameter tokens are like the bias (not exactly, though). There’s even a non-linearity too, though it’s not pictured in the first figure above. So, it’s really not that surprising that something like Tokenformer works.

The researchers tested Tokenformer on both vision- and language-modeling tasks, and their results indicate that the method works well in practice, but further research will need to be done to determine whether their approach has some subtle weaknesses or if it can scale up to the size of foundation models. I’ve been reading about and using transformers for years at this point, but it’s always a delight to learn about the new ways that researchers are modifying the architecture to solve new problems or address existing challenges. The idea of Tokenformer — using parameters for tokens — seems like an obvious thing to try but, as always with matters of familiarity, it’s not one that I or any other AI researchers have shown the viability of before.

Image generation for infinite games

Unbox Research — Sun, 10 Nov 2024 13:11:45 GMT

Summary by Adrian Wilkins-Caruana

[pdf on arxiv]

Video games offer players a sense of interactivity and open-endedness that things like films or books often can’t. But even video games don’t offer an infinite degree of interactivity and flexibility. For instance, they might bound the area that a character can explore or limit the actions they can take. The obvious reason for this is that game designers can’t create an infinite amount of content. But what if AI could create the content instead? This is where a new game called Unbounded comes in: Its content is generated by vision and language models in real time to create an endless gaming experience.

Illustration by Giulia Zerbini

Unbounded’s concept is quite simple: It’s a story that unfolds as the player interacts with the game. For example, the image below shows Archibus the Wizard teaching his students, but then he gets hungry, so then the player instructs him to eat some pears. The curious thing about this game is that the player can instruct Archibus to eat anything because Unbounded generates its content on the fly based on the player’s instructions, and the game’s narrative progresses with each turn. The researchers call it a generative infinite game.

Several recent advances in generative AI have helped to make Unbounded possible, the first being latent consistency models. LCMs are latent diffusion models that can generate high-resolution images in a few diffusion steps — far fewer than the tens or hundreds of steps typical of diffusion models — allowing near real-time image generation in around 1 second per image. Another pair of technologies enable consistent generation of a character from scene to scene: Dreambooth and low-rank adapters (LoRA). With a few generic images of a subject like, say, the wizard Archibus, the Dreambooth text-to-image model that can make new images of the subject, like Archibus in a classroom studying arcane magic. The researchers fine-tuned the LCM + Dreambooth system for Unbounded scene-generation task using LoRA, which is an efficient way to fine-tune models by training a low-rank matrix of weights that get added to the original model’s frozen weights.

In addition to generating consistent characters, Unbounded also needs to generate consistent environments. To do that, the researchers made a new method called regional image prompt adapters, which is an extension of an existing technology called image prompt adapters. A regular IP adapter is a really effective way to generate a new image using an image plus a text prompt. But Unbounded needs to generate scenes using two image prompts: One for the character, and another for the environment. So, the regional IP adapter is a way to guide the diffusion model to generate each of these prompts in different parts of the image using a dynamic mask (which I’ll explain shortly). The figure below shows how these components work together to generate the final scene.

The clever part about the dynamic mask is that the scene is partitioned into character parts and environment parts automatically as the model generates the scene. To do this, the researchers noticed that some parts of the image have high cross-attention to the character prompt, while other parts have lower cross-attention. So, using a predefined threshold, they selected the blocks in the image that have the highest cross-attention to the image prompt to be where the model generates the character, and the other parts for where the model generates the environment.

The researchers used quantitative experiments to show that their method generates more consistent characters and environments than other scene-generation methods, and also that the rest of the image content aligns with the other details in the prompt. The figure below also shows some examples from these experiments. (The “[V]” in the prompt is a special symbol that the model understands as the character.)

To turn this into a game, the researchers needed a way to make the open-ended user input into a prompt the system can understand (e.g., replace “Archibus” with “[V]”) and to describe the progression of the game’s narrative (e.g., “Archibus is hungry”). To do this, they compiled a dataset of narrative topics, and then used these topics to simulate user-LLM interactions — using another LLM in place of the user — that include various environments, character actions, and game mechanics. The researchers then use these simulations to fine-tune an LLM for prompt re-writing and narrative progression.

The Unbounded game and generative infinite games are really cool ideas, and I can’t wait to see how game designers will use them. As exciting as the idea is, though, I wonder how interesting Unbounded is to play. Though its image-generation component is quite capable, the narration component was trained on a dataset of two LLMs interacting with each other, so I’d be surprised if the narratives are enthralling. But Unbounded’s LLM could conceivably be swapped out for one that’s better at coming up with good stories. I anticipate that future iterations to Unbounded might make progress in this area, and I think it would make for some very interesting games.

Do LLMs rely on data contamination to solve math problems?

Unbox Research — Sat, 02 Nov 2024 12:54:49 GMT

Summary by Oreolorun Olu-Ipinlaye

[pdf of the paper]

Computers have always been great at math — or at least at mathematical operations. You plug in numbers and math operands and you get a result. But computers couldn’t solve word problems until LLMs arrived. That’s because solving word problems requires natural language processing (NLP) and logical reasoning. But how well do LLMs actually handle math problems? Do they logically reason through them or do they just rely on the knowledge of similar questions that they were trained on? That’s the subject of today’s paper.

Illustration by Giulia Zerbini

LLMs have been quite impressive when evaluated on the Grade School Math 8K (GSM8K) dataset, a popular dataset for evaluating LLMs on multi-stage mathematical reasoning tasks. But the authors of today’s paper argue that GSM8K isn't sufficient for accessing mathematical reasoning, so they created a new dataset, GSM-Symbolic which is derived from GSM8K but lets researchers generate different versions of the same question.

As you can see in the image below, GSM-Symbolic is a template that allows for question variables to be swapped out. The researchers did this because they wanted to investigate how an LLM’s accuracy changes when one or more variables are changed. They generated 50 different variations of 100 questions in the original GSM8K dataset, for a total of 5,000 questions. They then evaluated over 20 models, both open and closed source, and I must say the results were quite surprising.

They discovered that by simply changing the variables of a math problem, model accuracy across most LLMs varied significantly, so much so that they actually termed it a distribution — little wonder why when you see the image below. Not just that, but average model performance was also lower compared to evaluations done on the GSM8K dataset.

The researchers also discovered that the models were more sensitive to changes in numerical variables than to changes in names; performance dropped significantly more when numbers were changed. So why are the models so sensitive to changes in variables? The authors make a very valid argument as to why: data contamination. They argue that since the GSM8K dataset is so popular, there’s a chance that some of its questions have been used to train LLMs and the models have simply memorized these questions or learned to associate logical patterns with them.

But wait, there's more! The authors also investigated how question difficulty affects performance. They made questions simpler by removing clauses and harder by adding more, and discovered that performance drops and variance increases as question difficulty increases. Finally, they designed experiments to test whether the models actually understand mathematical concepts. They did this by sneaking in random clauses that have no relevance to the math logic required in solving the questions. As you can likely guess, model performance dropped significantly across all models, indicating that they picked up on the irrelevant clauses and attempted to use them in solving problems.

What does this all mean? We’ve all experienced token bias when using LLMs: You structure your prompt slightly differently and you get a different response. In a logical process like math, though, the expectation is that the model must have learned the underlying logic of solving similar problems — but apparently not. The fact that changing numerical values, the very essence of math problems, was a factor in accuracy is a clear indication that the models haven’t learned and don’t follow the required logical reasoning steps for solving word problems. Could it be just because of data contamination? Clearly more investigation is needed. Personally, I’d love to see how the models perform on more complex word problems since GSM-Symbolic is a grade-school level dataset.

Running an LLM on a small customizable chip

Unbox Research — Wed, 30 Oct 2024 12:23:56 GMT

Summary by Adrian Wilkins-Caruana

[pdf on arxiv]

If you open the ChatGPT app on your phone, you’ll find that the system can reply to questions about as quickly as you can type them. This is possible thanks to the miracle of the internet: Your queries are whisked away to a data center where a big, powerful computer — the one that’s actually running the LLM — produces the response. One reason this is necessary is that your phone’s processor isn’t fast enough to handle LLM responses in real time. But our phones are quite incredible devices that can seemingly do anything, so why can’t they run LLMs? Today’s summary discusses some of the challenges and some clever solutions to them.

Illustration by Giulia Zerbini

Not all processors are made equal. Some are general-purpose, while others are highly specialized for specific applications. You can think of general-purpose processors as being like craftsmen and of specialized processors as factories. Because they’re specialized, factories can make things more quickly — but with less flexibility — than craftsmen. Part of the challenge in making a factory-like processor for LLMs is finding ways to set up the factory to maximize efficiency, which is exactly what Xu et al. have done in their recent study. Specifically, they’ve efficiently implemented Llama 2 on a field-programmable gate array, or FPGA (which is a type of factory-like processor).

Their approach has many identical blocks for all the layers in Llama 2’s forward pass, and each block has three stages. The first stage, called pre-processing, is shown in the figure below. This stage is responsible for reading input vectors X — which might be input tokens or outputs from another layer — and preparing them for the next stage, and is responsible for reading the weights for that layer. The inputs and weights are quantized, and are accompanied by a scaling factor (xs and ws for the input and weights, respectively). Quantization is a way to compress the 32-bit floating point inputs and weights into less space (8 bits), and the scaling factors are needed to undo the compression. There are two important details here: First, the weights are streamed rather than read all at once, and second, the quantized values are cast from 8-bits to 16-bits (we’ll see why these details are important in the next paragraph).

The next stage implements the fundamental operation of neural networks: the dot product, which is essentially some “multiply” operations followed by some “add” operations. The stage is called a quantized matrix-vector multiplication because it’s specifically designed for the quantized inputs and weights. As the 16-bit weights are streamed into this stage, they’re multiplied by cached 16-bit input vectors. Even though the data in the inputs and weights doesn’t exceed 8-bits, the result of the multiplication might overflow an 8-bit data type, so the 16-bits ensure there’s enough capacity to prevent that. These results are then accumulated together by adding and combining pairs of results. This can be done efficiently in parallel using an adder tree like the one in the figure below. Preventing overflow is important here, too, so the multiply-results are cast to 32-bits before summing.

The final processing stage, called accumulate, is pretty straightforward. It undoes the quantization by first casting the 32-bit integer group sums to a 32-bit floating point format, and then scaling the results from the group sums by this value. These three stages — pre-processing, dot product, and accumulate — are the building blocks of the Llama 2 FPGA implementation, but another key detail is the orchestration of data through these stages. Because the weight values are streamed, the execution of the blocks can happen while weights are being copied to and from memory. For example, the figure below shows how asynchronous orchestration shortens the overall duration of the memory-copying (blue) and processing (orange) stages.

In their tests, Xu et al. measured huge gains in efficiency using their FPGA-based implementation. They used a special chip called a system-on-a-chip, which combines both a general-purpose processor and an FPGA. Compared with using a general-purpose processor, the FPGA-based implementation performed 23.4x more operations per second and 15.8x more tokens per second! The FPGA-based approach was also 6.1x more energy efficient.

These margins demonstrate just how much more efficient task-specific processors are than general-purpose ones. This is why lots of new processors are shipping with dedicated AI accelerators. These processors are highly-efficient at multiplying and accumulating, but not much else. As such accelerators become more powerful and as researchers get better at improving the performance of smaller LLMs, we may eventually be able to run ChatGPT on our phones without needing internet access.

Better language models with negative attention

Unbox Research — Fri, 18 Oct 2024 18:57:35 GMT

Summary by Adrian Wilkins-Caruana

[pdf on arxiv]

Sometimes I use ChatGPT or Gemini because I’m too lazy to read something. What I do is copy and paste pages upon pages of text into the system and then ask it a very specific question about the text, hoping that it will comprehend the text well enough to find the piece of information it needs to answer my question. As you can imagine, this isn’t straightforward for LLMs (or humans, either), so LLM researchers are constantly looking for ways to make LLMs better at this task, which they call “needle-retrieval” (from the idiom “to find a needle in a haystack”). Today’s summary is about a new needle-retrieval method called the differential transformer.

Illustration by Giulia Zerbini

The differential transformer is a variant of the original transformer architecture, with the main difference being that this variant uses differential multi-head attention instead of the vanilla multi-head attention. In case you need a quick refresher, attention is a way for a neural network to weigh how important a pair of tokens in its input sequence is relative to all the other pairs. This weighting is what enables needle retrieval since the transformer can use strong weights to make connections between information about the needle and the needle itself. But, there’s a problem: Transformers tend to over-attend to irrelevant context (i.e., things in its context that aren’t the needle). The figure below shows this problem. The left bar chart shows the attention score for various tokens in the LLM’s context in a needle-retrieval scenario. We can see that the answer’s score isn’t that much more significant than the other context (unlike differential transformer in the middle chart, where it is much more significant).

You can think of differential attention as being kind of (but not quite) the difference between two attention heads. Each differential attention head learns a pair of query weights and a pair of key weights (as opposed to one each), and uses each pair to make a pair of query vectors and a pair of key vectors. In the first row of equations below, W^Q and W^K are twice as big as they would be in a vanilla transformer, and combining these with the input context yields the pairs of query (Q_1 and Q_2) and key (K_1 and K_2) vectors. The second row below shows how to use these to compute differential attention.

You might be thinking that doubling-up the queries and keys will make a differential transformer twice as large as a regular LLMs transformer, but no. The researchers find that a differential transformer only needs about half the amount of attention heads as a regular transformer, so they’re about the same size.

In case you’re wondering what the λ is in the equation above, it’s a learnable scaling factor that’s learned for each multi-head differential attention layer. The researchers proved that this formulation of differential attention (with the λ scaling factor) has similar gradient magnitudes during training as a regular transformer. The magnitudes differ by some constant factors, but neural network optimizers are invariant to such differences. The benefit of having similar gradient magnitudes is that training a differential transformer is identical to training a regular transformer (i.e., they can use the same hyperparameters). The figure below shows the multi-head differential attention architecture and the Python code for implementing it.

The figure below compares the differential transformer (right) to a regular one (left) in a multi-needle retrieval test. The horizontal axis denotes context length in thousands of tokens, and the vertical axis represents where in that context the needles are hidden. In the test, eight needles were hidden, but the LLM only needed to retrieve one needle to pass. The differential transformer had much more consistent needle-retrieval accuracy than the regular one across different depths and context lengths. The differential transformer is also much better at in-context learning, reaching accuracies 5% to 22% higher than a regular transformer in an in-context learning classification test. Finally, the researchers also found that differential transformers hallucinate less than regular ones, which is probably because they can more successfully find relevant information in their context to answer questions with (so they don’t have to resort to making stuff up).

The differential transformer architecture looks really promising, and I would be interested to see if new iterations of open-source LLMs, like Llama, adopt it. The tests that the researchers used (needle-retrieval, in-context learning, and hallucination rate) are ones that favor a model that’s really good at finding specific details in its context, but we don’t yet know whether this architecture is appropriate for general-purpose LLM and chat tasks. The researchers tested differential transformer’s general-purpose performance — it showed promise against other open-source architectures after training its 3B parameters with 1T tokens — but the jury is still out on whether this approach will scale to the 400B parameter, 15T token size of Llama 3.

A serious look at the future of AI medical advice

Unbox Research — Sat, 12 Oct 2024 13:24:50 GMT

Summary by Oreolorun Olu-Ipinlaye

[pdf of the paper]

When ChatGPT came out a couple of years ago, some people said AI was going to take everyone's jobs. Fast forward two years and, well, that hasn’t happened, and it probably won't anytime soon — if ever. Don’t get me wrong, these models are super useful, but they still hallucinate, making them unreliable for performance-critical fields like medicine. We’ve seen a couple of iterations of OpenAI models (and other LLMs) since then, each bringing some improvements, but how close are we to a future where these models are actually used for performance-critical tasks? How soon might we be able to trust them to make accurate clinical diagnoses? Today’s paper aims to investigate this by looking at OpenAI’s latest preview release, o1.

Illustration by Giulia Zerbini

As you might know, o1 is a generative model developed by OpenAI and released separately from the GPT series. What makes o1 stand out is its ability to chain-of-thought (CoT) reason. (If you’re curious about how CoT works, Adrian’s summary from a few weeks ago gives a good rundown.) As you’d expect from a new model, o1 outperforms GPT-4 across various general language tasks, but what’s really interesting is that, according to OpenAI, its ability to CoT reason makes it a vastly superior model for technical problems in subjects such as science and math. They claim it even beats human experts on PhD-level science questions! But how does it perform in a specialized field such as medicine?

The authors of today’s paper put o1’s clinical skills to the test using 35 popular medical datasets that are used in evaluating LLMs and two custom datasets based on questions curated from The Lancet, The New England Journal of Medicine, and Medbullets. These custom datasets contain questions that are more technical than the 35 medical ones, and the authors’ goal in using them was to push o1’s CoT reasoning capabilities to the limit. They divided the datasets into three categories to evaluate the model on the following aspects:

Understanding: Can the model pull out relevant info from a clinical query and give a clear summary?
Reasoning: Can it think logically and reach accurate conclusions based on the info provided?
Multilinguality: Can it handle tasks where the prompt and response languages are different?

The researchers compared o1 to four other LLMs — GPT-4, GPT-3.5, MEDITRON, and Llama 3 — and the results are quite interesting. When it comes to reasoning, o1 is superior to the other LLMs in all datasets except one. But when it comes to understanding, o1’s performance in many datasets wasn't as dominant, and in one particular dataset it was resoundingly outmatched by Llama 3 by close to 20 percentage points. In multilinguality, o1 was the clear winner with superior performances in all languages surveyed.

The authors also performed chain-of-thought prompting over direct prompting and discovered that this marginally improved o1’s performance — which isn’t surprising: if you can CoT reason then CoT prompting should feel natural. And what about hallucinating? Surely the fact that o1 can reason about a problem before providing a response means hallucination is a thing of the past, right? Well, not quite. In fact in many datasets o1 hallucinated more than GPT-4!

I find it quite interesting that the authors of this paper didn’t include any of Anthropic’s models (the various versions of Claude) in their evaluations even though Claude 3.5 Sonnet has performed brilliantly on a lot of LLM evaluations. It's also clear that there isn't a definitive framework for evaluating LLMs on clinical topics; in this paper, the authors compared four models to o1 in some evaluations and in others just two (the hallucination evaluation).

So are we closer to an AI doctor? For now I'll say not quite. The fact that o1 hallucinates is a sign that GenAI still isn’t mature enough to be used in performance-critical tasks. That being said, an LLM with CoT reasoning is definitely a step in the right direction, and with further refinement we could get to a point where they become reliable tools in the hands of doctors. Then maybe, just maybe, they’ll take over the field of medicine.

LLMs have original, research-worthy ideas

Unbox Research — Tue, 08 Oct 2024 12:39:45 GMT

Summary by Adrian Wilkins-Caruana

[Article on alphaxiv]

In the early days of my PhD — which was before ChatGPT — I spent a lot of time identifying what the open problems were in my field and assessing the feasibility of tackling them. While generating new research ideas is a highly speculative exercise, it’s arguably one of the most important skills of a scientist — which is why it was such an important part of my training. But even the most skilled researchers can spend years pursuing an idea that might not work or lead to a dead end. That’s why some researchers from Stanford have studied whether AI models like ChatGPT can automate the idea-generation process.

Illustration by Giulia Zerbini

Their study involved comparing human- and LLM-generated research ideas. They broke the task of research ideation and evaluation into three steps:

Generating the idea.
Writing-up the idea to explain and communicate it.
Expertly evaluating the idea.

I think this three-step process is quite fitting, especially because the second and third steps are akin to grant writing and appraisal, respectively, which are essential for getting new research projects funded.

There are lots of details in the research methodology. I won’t describe them in detail, but here are the highlights:

The LLM that generated the ideas was Claude 3.5, and it used something called retrieval-augmented generation, which means it could access a database of research papers and their references while generating ideas.
The researchers recruited 49 experts to write up ideas and 79 experts to review the write-ups (both the human- and LLM-generated ones).
Claude 3.5 also ranked the write-ups. To do this, it appraised pairs of research ideas, and picked the “better” of the two, repeating the process until all the ideas were ranked. In this context, “better” is purely Claude’s opinion, but it’s not complete baloney: The researchers found that Claude (more than other LLMs) is 71.4% accurate at picking the higher-scoring review in pairs of submissions to a famous machine learning conference (ICML).

When the LLM appraised the pairs of papers, it evaluated them across five different axes: novelty, excitement, feasibility, expected effectiveness, and a final overall score. Across 119 human-generated ideas and 109 AI-generated ones, the researchers found — with statistical significance — that the AI-generated ideas were both more novel and more exciting than human ones. But their results didn’t reveal any statistically significant difference in feasibility, effectiveness, or overall score. The figure below compares the scores of the human (yellow) and AI (light-blue) ideas. The AI+Rerank (dark blue) bars indicate scores where one of the authors further refined Claude’s rankings.

You might be wondering what some of these ideas look like, so I’ve included two examples and their problem statements (the first section of the write-up) at the end of this summary.

After studying these results, the researchers noticed some limitations of current LLMs that might prevent them from being useful scientific agents. First, they found that the LLMs didn’t generate diverse ideas — across 4,000 generated ideas, only about 200 were distinct (5%). The second issue is related to judging research ideas. Human reviewers of AI research papers typically agree with each other about 70% of the time, but the best LLM reviewer (Claude) only agreed with itself 54% of the time (meaning if you asked it to rank the same pair multiple times, it would often change its answer); this is only slightly better than random guessing (50%).

With both of these limitations, it’s probably best not to put too much weight on the paper’s main conclusion (that AI generates more novel and exciting research ideas than people do). But I think that’s missing the point. What’s clear to me from this study is that AI can absolutely generate research ideas! Sure, there’s some uncertainty about whether these ideas are slightly more novel or slightly less feasible than human ideas, but the ideas aren’t nonsense, which I think is incredible. I for one would have been so grateful to have an AI companion — even a flawed one — when I started my research.

As promised, here are a couple of examples of AI-generated ideas for NLP research:

Modular Calibration for Long-form Answers: Calibrating the confidence of Large Language Models (LLMs) when generating long-form answers, such as essays and code, remains an open challenge in the field of natural language processing.
Translation with LLMs through Prompting with Long-Form Context: Stable generation of text in low-resource languages is an unsolved issue in large language models.

OpenAI's o1 model

Unbox Research — Fri, 27 Sep 2024 23:15:08 GMT

Summary by Adrian Wilkins-Caruana

[OpenAI o1 announcement]

The folks at OpenAI have been hard at work iterating on their flagship LLM, GPT-4. While its successor, dubbed o1, isn’t quite ready for prime time yet, the company has made a preview version of it, o1-preview, available for use in ChatGPT and to some API users. We’re going to compare o1-preview’s performance to GPT-4 and take a look at how it works.

Illustration by Giulia Zerbini

OpenAI is touting substantial improvements in accuracy on a number of tasks, including math, coding, and science, as well as on other AP and SAT exams. For example, the figure below compares o1 and o1-preview to GPT-4. The shaded bars indicate the consensus of 64 model results, while the solid bars indicate single-shot accuracy (i.e., on the model’s first output). The performance of o1 and o1-preview on these tasks is quite impressive, even surpassing human-level experts on the science questions! And o1’s improvements aren’t just limited to science and math knowledge — o1 also beats GPT-4 on English literature, global facts, public relations, professional law, and econometrics exams.

It seems that OpenAI has achieved these impressive results in two ways. The first is more training, specifically reinforcement learning. As opposed to standard self-supervised learning — which creates great next-word predictors — reinforcement learning is the special kind of training that’s used to make LLMs useful for things like conversations and problem solving. The second way was to give the LLM more time to think about its answers before providing them. The classic way to do this is using what’s known as a Chain of Thoughts, where the LLM solves a problem step by step. (We’ve covered a few other ways to do this on Learn and Burn; see our summaries on Tree of Thoughts and the more sophisticated Buffer of Thoughts approaches.) According to their own article, OpenAI seems to be using an approach called Chain of Thoughts (CoT). The figure below shows how o1’s accuracy scales with train- and test-time compute.

You might be wondering what o1’s thoughts look like. Here’s an example for a cipher-solving task:

oyfjdnisdr rtqwainr acxz mynzbhhx -> Think step by step

Use the example above to decode:

oyekaijzdf aaptcg suaokybhai ouow aqht mynznvaatzacdfoulxxz

For this task, o1 thought about different hypotheses — it considered anagrams and letter substitutions — before realizing that the letter groups in the cypher text correspond to words in the decoded cypher, and that consecutive letter pairs correspond to decoded single letters (so two letters map down to one). o1 then noticed that half the sum of the alphabet positions (eg, a is 1, b is 2, etc.) for pairs of letters in the cypher text correspond to the alphabet positions of the decoded letters. So the letter-pair “oy” decodes to “t” since (15 + 25) / 2 = 20. This allowed o1 to successfully decode the cypher: there are three r’s in strawberry.

OpenAI says that they’re using CoT, but I think their approach deviates from the standard CoT approach from the original paper, which simply uses prompts to encourage the LLM to “think out loud” about its thought process. Instead, they say o1 learns to hone its chain of thought and refine the strategies it uses through reinforcement learning. I’m guessing that they mean that they use human feedback, but it’s not clear to me whether the humans provide feedback on just o1’s responses or on its thoughts, too. This is made even more opaque by OpenAI’s decision to hide o1’s thoughts from users; so it’s not clear whether they were also hidden from the people who provided o1 feedback during its training.

I think the CoT approach is a must-have for helping LLMs like o1 on advanced reasoning tasks, but I also reckon that, aside from simply scaling training and inference compute time, the o1 researchers also used many other clever tricks to eke more performance out of their LLM. For example, they might have paid close attention to curating higher-quality datasets or to eliminating any inconsistencies in their dataset of human feedback. This is just speculation though, and we can’t say much about o1 with any certainty — except for how impressive it is!

The subgoals of attention units in LLMs

Unbox Research — Fri, 20 Sep 2024 12:22:08 GMT

Summary by Oreolorun Olu-Ipinlaye

[pdf of the paper]

Large language models have shown their utility across all kinds of tasks and they’re only getting better. Improving their performance often involves increasing their parameters rather than changing their architecture or reasoning pathways because LLMs are seen as black boxes, which makes it difficult to identify areas to adjust. But researchers have begun analyzing how LLMs think, focusing on their attention heads, a critical portion of the transformer architecture that LLMs are built on. The authors of today’s paper looked closely at LLMs’ reasoning processes by reviewing prior research on attention head interpretability.

Illustration by Giulia Zerbini

To demonstrate how attention heads work, the researchers compared them to the way human brains solve problems. According to neuroscience, the brain’s problem-solving process involves four phases:

Knowledge Recalling: We recall relevant knowledge.
In-Context Identification: We analyze the context of the problem.
Latent Reasoning: We use that information to reason and come to a conclusion.
Expression Preparation: We express the solution in natural language.

The authors then draw parallels between this process and attention heads, which serve a comparable problem-solving function in LLMs. In the case of Knowledge Recalling (KR), LLMs learn during the training or fine-tuning process and this knowledge is stored in the model’s parameters — this is called parametric knowledge. The attention heads involved in this phase recall knowledge by initially guessing or focusing on specific context to retrieve relevant parametric knowledge.

Once the LMM has retrieved that knowledge, then comes In-Context Identification (ICI). The authors quoted past research that states that there are specific attention heads involved in identifying key information in a query. These heads focus on three elements of the query:

Structural information: overall context, positional relationships, rare words, and repeated content
Syntactic information: sentence components like subjects, predicates, and objects
Semantic information: task-relevant content, answer-related tokens, sentiment, and relationships between words

The next phase, Latent Reasoning (LR), is where the authors say specific heads of the LLM combine information derived from the KR and ICI phases and perform reasoning. During this stage, the model begins to "think" through the data, combining patterns, relationships, and insights from earlier phases to draw conclusions. Finally, the last phase, Expression Preparation (EP), is where certain attention heads take reasoning results and turn them into natural language outputs that the user sees. EP heads aggregate information from the ICI and LR stages, amplify signals of correct choices, and ensure that the output is coherent and clear. The image below provides an overview of all four phases and the specific attention heads involved in each one.

How exactly did the authors identify the functions of these attention heads? They used experimental methods that are categorized into modeling-free and modeling-required methods. Modeling-free methods modify the LLM’s internal representations while analyzing the impact on output. Modeling-required methods involve creating new models and can be split into two types: training required and training free. Training-required methods involve training classifiers or simplified models to analyze the functions of different heads. Training-free methods, on the other hand, use scores or information flow graphs to understand the heads’ attributes. For example, researchers used Retrieval Score and Negative Attention Score to analyze retrieval heads and negative heads, respectively.

The researchers noted that current research on attention heads in LLMs faces limitations due to its focus on simple tasks, lack of comprehensive frameworks for head interactions, and absence of mathematical proofs. They think that future research should address these gaps by exploring complex tasks, improving experimental methods, developing robust interpretability frameworks, and integrating insights from the field of Machine Psychology. I think it's great that research is focusing on understanding how LLMs work — scaling parameters has worked well so far but doing so requires powerful hardware. Understanding the inner workings of LLMs could allow for a big performance leap in natural language processing (like we saw between recurrent neural networks and transformer models) without needing massive amounts of computing power.

A model to parse body language

Unbox Research — Mon, 16 Sep 2024 22:56:33 GMT

Summary by Adrian Wilkins-Caruana

[pdf of the paper]

When people interact with each other, we use more than just our words to communicate: We also use body language, poses, and facial expressions. When chatting with someone over the phone or via text, misunderstandings can happen because these other communication vectors are absent. This issue is also quite frustrating for Meta, which is trying to create the metaverse, a place where people can communicate with each other through their digital avatars. I think that people will only embrace the metaverse — and that the metaverse will only succeed — if Meta can figure out a way to capture and display all of these subtle non-verbal cues in a way that feels natural. Luckily for Meta, they seem to have figured out the capture part of this equation.

Illustration by Giulia Zerbini

The researchers at Meta have trained a new vision-foundation model for four important human-centric perception tasks: 2D pose estimation, body-part segmentation, depth prediction, and surface-normal prediction. The figure below shows each of these tasks. The gist of their approach is this: first, they pretrained a vision-foundation model on images of people, and then they fine-tuned the model on the four tasks I just mentioned. They named the resulting foundation model Sapiens.

When deciding what kind of data to pretrain Sapiens on, the researchers considered two situations. They could either pretrain it on as much data as they could get their hands on, or they could curate a dataset that only contains images of people. More diverse data could help the model generalize to real-world scenarios, but curating a dataset could help Sapiens become an expert at understanding pictures of people. One of the big claims of their research is that the curated-dataset approach is the right way to go.

Starting with a proprietary dataset of about a billion images, the researchers used a person-detection model to whittle it down to about 300 million images — ones that the detector could confidently say contained a person. They then trained a human-centric, foundation vision transformer (ViT) — a model that has lots of generic information about what humans look like — using the masked autoencoding approach, which teaches the model to reconstruct a partially-masked image. To make this ViT perform tasks like pose estimation and body part segmentation, the researchers appended different “heads” (ML heads, not human heads) to the end of the model to do these things. They also created several variants of Sapiens of different sizes, ranging from 300m to 2b parameters.

The pose-estimation head predicts a heat map — which is like a prediction layer — for each keypoint (second index finger knuckle, left elbow, etc). The prediction value for each point in the map is the model’s estimation of the likelihood that that point is the keypoint. Sapiens predicts 308 of these keypoints, 243 in the face alone! A heatmap can contain predictions for a keypoint at several locations, since this lets it predict multiple people, as shown in the fifth example below. Using so many keypoints helps the model capture a lot more detail than existing pose estimators, which top out at 68 facial keypoints. Compared to the next-best pose estimator, the largest Sapiens variant is ~7% more accurate in terms of average precision and recall.

The Meta team took a similar heat-map approach for segmenting body parts, where Sapiens predicts one layer for each body part. The researchers collected 100k images annotated with 28 body parts, which is an increase from a more standard 20-part set. Again, this approach naturally handles multi-person prediction, like in the first example below. This extended set distinguishes between the upper/lower parts of limbs, and includes more facial details such as teeth and upper/lower lips. The largest Sapiens variant is ~15% more accurate than a DeepLabV3 baseline (a very capable image-segmentation model) in terms of accuracy and IoU (intersection over union, which is a way to measure similarity between a predicted region and the ground truth region it’s trying to match).

I won’t go into the details of Sapeins’ depth- and normal-estimation heads because the researchers followed a similar approach: They extended the current state-of-the-art approach by improving the data quality (as opposed to quantity), and then followed conventional modeling approaches. The researchers showed that Sapiens is more accurate than several baselines for these tasks, though not all baselines were human-centric models.

It’s not surprising that Meta is trying to develop models like Sapiens for human parsing. They are all-in on the metaverse and, in my opinion, it’s critical that participants’ avatars are convincing and lifelike. I think the additional facial details of the pose and body-part models will be critical to achieving this, though I have doubts that a 2b-parameter model can be scaled to predict a user’s body in real time (as opposed to asynchronously, like Sapiens). But Sapiens is still a huge step towards achieving this, and I look forward to seeing how this development improves Meta’s product.

Stable diffusion can simulate video games

Unbox Research — Fri, 06 Sep 2024 12:10:58 GMT

Summary by Oreolorun Olu-Ipinlaye

[pdf of the paper]

Video games are a joy to play — they’re a great pastime, hobby, or even job (competitive gaming). Developing video games, on the other hand, is no simple task: Game developers spend months handcrafting and hardcoding game states based on user inputs. But imagine if game states weren’t handcrafted and that we could use AI to generate scenes, scenarios, and characters based on user input. While this may seem like science fiction, recent advancements in computer vision generative models are bringing it closer to reality. The authors of today’s paper have shown that gen AI models can simulate dynamic, interactive game states.

Illustration by Giulia Zerbini

Their model, named GameNGen (pronounced “game engine”), can simulate the 1993 release of DOOM to a high level of accuracy and at a visual quality comparable to the original game. But what type of model is GameNGen? It’s a diffusion model that they trained in a rather unorthodox manner. Instead of training it to just sequentially remove noise from images — as most diffusion models typically do — they trained it to generate the next frame in a sequence based on previous frames and user inputs/actions.

To do this, they needed a lot of training data. Their intention was to have GameNGen generate simulations that players would interact with, so they needed a lot of human gameplay data. As you can imagine, this could have been prohibitive due to the scale of data required, so they trained a reinforcement learning agent to play DOOM while collecting all their data from agent gameplay sessions — they called this agent play. One interesting thing the authors did was record agent sessions right from the start of training. They weren’t only interested in perfect gameplay sessions. which would have been archived after appropriate training. They were also interested in the early sessions where the agent played poorly because this mimics low-skilled human players.

With the human-like gameplay data collected, it was time to train GameNGen. At the core of GameNGen is a pre-trained text-to-image diffusion model — Stable Diffusion v1.4, which the researchers had repurposed by engineering the text-conditioning logic to take in user gameplay actions (e.g. key presses) rather than text. They did this by learning an embedding of each action. They also conditioned the model on previous frames, which they call observations. To do this, they encoded the frames into a latent space using the Stable Diffusion autoencoder, added some noise, and then concatenated them into latent channels. They then trained the model to minimize diffusion loss with velocity parametrization. The figure here illustrates these processes:

After training, the researchers found that GameNGen performed well in image-quality metrics with PSNR (peak signal-to-noise ratio, a measure of similarity between a true image and an approximate image) of 29.43 and LPIPS (learned perceptual image patch similarity, an alternative to PSNR that is more aware of the quirks of human perception) of 0.249, on par with what you’d see in a lossy JPEG compressed image with a quality setting of 20–30. When it comes to video quality, because of the auto-regressive nature of the task (the model needing to generate the next frame from its own predictions), the authors discovered that the generated game images looked worse over time due to the accumulation of errors between sequential frames. They also had people evaluate the simulations, asking them to identify real gameplay footage. The evaluators were only able to correctly identify actual gameplay 60% of the time!

The authors also ran some ablation experiments to evaluate the effectiveness of certain approaches they took during training. They found that using more past frames and actions helped improve the quality of simulations up to a certain point where the model reached a plateau. They also found that adding noise to frames during the training process was highly beneficial as performance took a big hit without it. They then compared training based on data collected by random user actions to training based on agent play and found that the model trained with agent-generated data outperformed that trained with random data, albeit only slightly in some instances.

GameNGen is, in my opinion, an incredible development and a precursor to potentially great leaps in the game industry. I believe that AI can complement traditional ways of doing things rather than completely replace them. So I think systems like GameNGen could complement traditional game development as a new feature in the game logic that ensures every player has unique experiences based on their playstyle — this would be particularly beneficial in open-world games. But we’re a long way from that as GameNGen simulated a game released in 1993. It’ll take significant advancements for us to simulate games like The Last of Us or Black Myth: Wukong.

How to build a lensless camera

Unbox Research — Fri, 30 Aug 2024 12:25:50 GMT

Summary by Adrian Wilkins-Caruana

[pdf of the paper]

Have you ever wondered why cameras have lenses? The camera on your phone and our eyes are common examples of image sensors that have lenses in front of them to focus the light. But why is the lens necessary? In other words, why can’t you just point an image sensor without a lens at something you want to take a photo of? It is actually possible to take photos without lenses, though it’s quite challenging. But with some clever computational imaging tricks and the aid of diffusion models, lensless photography is becoming more tractable.

Illustration by Giulia Zerbini

If you’ve ever used a camera with a detachable lens, you might have some idea of what an image looks like without a lens. In case you haven’t, here’s an image I found on Reddit that shows an example. The “preview” on the right shows what the camera “sees” without its lens — it’s a blurry mess. In photography terms, the image isn’t in focus. Focus means that light from a single point in the scene goes to a single point on the image sensor; this is the critical role that lenses play. But in the real-world, light is bouncing around everywhere in all directions. Without a lens, light from lots of places is landing on every part of the sensor. That’s why the preview looks reddish: the scene is mostly reddish. So a lens plays a very important role by focussing only the light from one point in the scene onto one point on the sensor.

Lenses are a common way to focus light, but sometimes a lens doesn’t work. For example, have you ever thought about how an X-ray image is captured? X-rays have much higher energy than visible light, so they aren’t as easily deflected by glass lenses like you’d find in a camera. (You can read more about X-ray optics on Wikipedia — it’s fascinating!) So researchers are constantly looking for new ways to capture images without lenses. In 2015, a team discovered a way to capture images using a mask-sensor assembly, an image sensor that’s manufactured with a mask (which I’ll explain shortly) in front of the sensor that interferes with the incoming light. The diagram below shows how an image of a scene can be reconstructed from the blurry mess that the sensor captures.

In theory, it’s possible to model how the light in the scene interacts with the sensor, and to reconstruct an image using this model. This isn’t possible in practice though, since there are too many unknown parameters in the model. So the researchers had the clever idea of using a mask to interfere with the incoming light for particular rows and columns of the image. The pattern on the mask is actually very important: It’s the outer product of two 1d binary patterns — one for the rows, and one for the columns. That’s why it kind of looks like an irregular checkerboard pattern. This property of the mask simplifies the computational reconstruction significantly. I won’t get into the details, but with some linear algebra (read: basically magic) an image is captured.

While helpful from a computational standpoint, masking the incoming light isn’t good for capturing crisp images. But some clever researchers from Tel Aviv University had the idea of using AI to improve the image quality. The two main components of their method are a pre-trained latent diffusion model (in this instance, it’s Stable Diffusion), and a neural network called ControlNet. ControlNet guides the image-generating diffusion process for some specific task, like reconstructing images from masked, lensless photos. The control and diffusion processes can also be guided by a text prompt to further improve the reconstruction. The researchers call this process DifuzCam, and it’s shown below. (There is one more notable bit: the separable transform, which splits the single-channel input signal into four color channels.)

There are two training signals that help DifuzCam learn how to reconstruct lensless photos. The first is a simple reconstruction (l_C), which encourages the generated image to look like the target image. The researchers call the other signal a separable reconstruction loss, which is just the L2 loss between the target image and the result of a learned 3x3 convolution on the color-separated input. The ControlNet’s parameters are trained via supervised learning using a dataset of lensless photos and regular ones.

The figure below shows some of DifuzCam’s results, both with and without text guidance. It clearly generates crisper images than some of the other methods shown, and also has fewer artifacts when you zoom in. The results are a massive improvement over the simpler, original reconstruction method I described earlier. The DifuzCam process could pave the way for very thin cameras, ones could be used in phones or other situations where they need to be tiny. Of course, the tradeoff is that the images need to be processed by a hefty neural network. Still, I think these results are quite remarkable!

LLMs are great geo-aware predictors

Unbox Research — Fri, 23 Aug 2024 12:11:12 GMT

Summary by Adrian Wilkins-Caruana

[pdf of the paper]

Have you ever gotten lost while traveling? Losing your bearings when you’re in a new place is a common problem, one that doesn't happen as often when you're in a familiar place like your hometown. That’s because your brain builds up a sense of direction over time, which lets you navigate your surroundings with ease. This sense of place is a cognitive map that helps you understand many things about your environment and the people in it. Is this sense of direction and of place unique to humans? Do LLMs have a sense of place? A new study by researchers at Stanford suggests that they do. The researchers found that LLMs can answer a wide array of geospatial questions, about topics like population, asset wealth, and demographics.

Illustration by Giulia Zerbini

The researchers first worked to determine what kind of geospatial location formats LLMs can understand. If I were to ask you, "Where are you right now?" you would probably answer with an address like "42 Wallaby Way, Sydney." To make sense of this, you’d need to know where in the world Sydney is, and then where in Sydney 42 Wallaby Way is. Another way to answer this question is with a set of latitude and longitude coordinates . (According to Google Maps, the fictional dental practice from Finding Nemo has the coordinates –33.8690005, 151.2091858). Unless you spend a lot of time looking at maps, you're unlikely to recognize those coordinates as being located in Sydney but, with some understanding of latitude and longitude, you can probably determine that it must be somewhere in the southern and eastern hemispheres.

The researchers used an additional format for location that shares information about some nearby locations. They devised a way to generate these nearby locations automatically using a free API. For example, here are the locations of places near the Calyon Building on 6th Avenue in New York City:

Nearby Places:
"
0.6 km South-West: Theater District
0.7 km North: Columbus Circle
0.7 km East: Midtown East
0.9 km South-West: Midtown
1.0 km West: Hell’s Kitchen
1.2 km North: Lincoln Square
1.3 km South-West: Garment District
1.4 km South-East: Turtle Bay
1.4 km South: Jan Karski Corner
1.4 km South: Midtown South
"

The researchers then turned their attention to how LLMs can use these location formats to answer questions about various places, and how to measure their accuracy. They used several datasets of info about population densities, home values, mean income, infant mortality rates, and more. To retrieve this kind of information from the LLM, using a method they dubbed GeoLMM, the researchers made a prompt that looks like this: ". What is the on a scale from 0.0 to 9.9?" For example, to ask about the population density of Sydney, the prompt would be "Coordinates: -33.8690005, 151.2091858, Address: ..., Nearby places: ..., What is the population density on a scale from 0.0 to 9.9?" They also restricted GeoLLM's answer to three tokens, which forced it to give a numerical answer like "7.2" or "3.5."

To measure the accuracy of GeoLLM's answers, the researchers had to turn the actual data into the 0.0–9.9 scale that GeoLLM uses. To do this, they scaled the values of the data to this range. (If the original data wasn’t uniform, then they distributed the values evenly across the range.) Once they’d done that, when GeoLLM answered a lot of these questions, they could measure how well its answers correlated with the actual data using the Pearson correlation coefficient. They then fine-tuned several LLMs so that they became accustomed to answering each specific geospatial question.

A fine-tuned GPT-3.5 outperformed other LLMs (Llama 2, GPT 2, and RoBERTa) and several other non-LLM baselines. Across nine tasks, GPT-3.5's correlation coefficient was between 0.55 and 0.87, which is really quite impressive! The researchers also used an ablation study to show that the Nearby Places format (used as part of the prompts) is crucial for improving the LLM's performance — without it, the LLM’s answers don't correlate as well with the actual data. (The addresses were the next most helpful piece of info and, unsurprisingly, the coordinates were the least helpful.) You can see how well GeoLLM works in the figure below: The colors (green is good) show the absolute error in quantitative questions (eg, what’s the population density?) asked about each given place. The XGBoost model does much worse than the GPT-based model, showing that LLMs can be viewed as trainable best-in-class predictors for geographic models.

This study highlights just how much knowledge an LLM accumulates during its training. Even though the LLM’s predictions were rarely perfect, they were still quite good, and they show that the LLMs have some understanding of the world we live in. One interesting application of this work would be to use the model to fill in missing data in datasets, or to provide a rough estimate of a geospatial value when the actual data isn't available. This could be especially helpful in places where data is scarce or where it's difficult to collect data, like in remote areas or in countries with limited resources.

Spherical-graph based weather prediction

Unbox Research — Fri, 16 Aug 2024 12:11:38 GMT

Summary by Adrian Wilkins-Caruana

[The original DeepMind article]

Last week, we discussed two amazing mathematics models from Google Deepmind: AlphaProof and AlphaGeometry 2. (Definitely give that article a read if you haven’t!) Today we’ll take a look at an interesting weather-prediction model called GraphCast, which Google DeepMind released last year. Google’s press release says that their method can predict weather conditions up to 10 days in advance faster and more accurately than an industry-standard weather simulation system. Let’s explore how systems like these work, and then discuss how and why an AI-based approach might be better.

Illustration by Giulia Zerbini

Meteorologists predict the weather by plugging information about what the weather is now — temperature, humidity, air pressure, etc. — into carefully designed physics equations that calculate what the weather will be in the future. This approach is called numerical weather prediction (NWP), and it’s used by the gold-standard NWP method called high resolution forecast (HRES). Despite how truly incredible NWP methods are, they’re not perfect. I’m not an NWP expert, but I would guess that tiny errors in weather measurements or slight differences between the equations’ model and the real-world might be some examples of NWP’s limitations.

The Google DeepMind researchers think that AI-based methods may be suitable here, since they might be able to learn the complicated relationship between past and future weather observations. So the researchers designed GraphCast, a model that uses a graph neural network (GNN) to model weather. GNNs are actually quite similar to the common Transformer neural network architecture. The “graph” part of GNN refers to how information can flow between different parts of the NN. In a Transformer, the “graph” encodes the idea that the observed tokens affect future ones, whereas GraphCast encodes the idea that the weather at a particular location affects the weather at nearby locations. The researchers use an icosahedral mesh — a sphere-like structure composed of triangles — to represent the state of the weather around the globe. They actually use several mesh levels, as shown below. (I’ll explain how these levels are used shortly).

At each node or vertex, and at each edge in this mesh are some weights, just like in a regular neural network. As you can see in the figure above, the information at each node can be “passed” to neighboring nodes. This is where the different scales come in: they help pass information at different distances. GraphCast uses seven mesh levels, with the coarsest (M0) containing 12 vertices and the finest (M6) containing 40,962. The learned message-passing over the different meshes’ edges happens simultaneously, so that each node is updated by all of its incoming edges, as depicted by the blue arrows in the figure.

You might be wondering, “Why don’t they just use a square grid, like latitude and longitude, and pass messages along this grid?” I can think of two key advantages of the mesh approach. First, nodes on an icosahedral mesh have five neighbors, whereas nodes in a square grid only have four — so more information gets shared. But, more importantly, the nodes on the icosahedral mesh are equally spaced around the globe. On a latitude and longitude grid some nodes are closer together because the lines of latitude converge at the poles, causing the grid cells to become smaller and more distorted as you move away from the equator.

This fancy neural network with its graph and icosahedrons is really cool, but it isn’t all for show. GraphCast works in three steps: First, an encoder maps information about the current state of the weather on a typical longitude/latitude grid onto nodes in the mesh. (This info consists of five Earth-surface variables — including temperature, wind speed and direction, and mean sea-level pressure — and six atmospheric variables at each of 37 levels of altitude, including humidity, wind speed and direction, and temperature.) Then, GraphCast processes information on the mesh using the message-passing technique we just discussed. Finally, GraphCast predicts what the surface and atmospheric information will be 6 hours later. Here’s what this process looks like:

To train GraphCast, the researchers used a dataset called the ERA5 reanalysis archive. It contains decades of historical global weather maps, which are derived from vast amounts of historical weather observations that have been combined by the European Centre for Medium-Range Weather Forecasts (ECMWF) using traditional NWP. Using a bit of the data that describes what the weather was at a given time in history, and another bit from 6 hours prior, GraphCast learned to minimize the mean squared error of the weather 6 hours into the future. To predict the weather more than 6 hours into the future, the model’s output can be autoregressively fed back into itself, but the results become less reliable the more this is done.

The Google researchers say that GraphCast is more accurate than the gold-standard NWP methods used today, but in their press release they shared a great example of how a difference in weather prediction accuracy can have a real-world impact. In September of last year, Hurricane Lee made landfall in Nova Scotia. GraphCast predicted that this would happen 9 days before it did, while the best NWP systems couldn’t confidently say where Lee would make landfall until 6 days prior. I think that’s both quite an impressive feat, and also quite a meaningful difference!

The researchers have open-sourced GraphCast’s code, which you can read for yourself if you’re so inclined. Also, the ECMWF are running and publishing weather predictions from GraphCast in an experimental section of their website. On that web page, it says that GraphCast’s resolution (0.25°) isn’t fine enough to pick up on some details, and that the model doesn’t predict all the weather variables that might be of interest. Despite this, it’s great to see official groups like the ECMWF take research like this seriously. It’s amazing to think that such research can have a meaningful impact on our everyday lives.

AI is solving international-level math competition problems

Unbox Research — Fri, 09 Aug 2024 12:12:52 GMT

Summary by Adrian Wilkins-Caruana

[The original DeepMind article]

Illustration by Giulia Zerbini

The International Math Olympiad (IMO) is an annual competition where rising stars in mathematics represent their country in a test of their abilities. One of the contestants in this year’s IMO was a peculiar stateless entrant, backed by the folks at Google DeepMind. The entry consisted of the joint efforts of two AI systems: AlphaProof and AlphaGeometry 2. The system scored a total of 28 out of 42 points, or, put another way, it got perfect scores on the four (out of six) problems it managed to solve, and none for the others. It placed 58th out of 609 contestants, which this year equated to a silver medal.

First, some groundwork: Regular human contestants have just two 4.5-hour blocks to submit solutions, but Google DeepMind’s systems took longer than that. AlphaGeometry 2 solved one problem within minutes, but AlphaProof took three days to solve two algebra problems and one number theory problem. The systems weren’t able to solve the other two problems, which covered combinatorics. So, this AI participant wasn’t really a contestant, but I don’t think that takes away from Google DeepMind’s achievement.

As its name suggests, AlphaProof proves mathematical statements, and it does this using a combination of two things. The first is a software tool called Lean, which verifies whether a proof of a mathematical statement is correct. Proving something in Lean is kind of like finding a path from your house to the grocery store. The proof itself is a series of moves or proof steps, like “turn left” and “stay on Biscayne.” Lean checks each step to make sure it's valid, which is important because, depending on the state of the problem, not all moves are valid. The second part of AlphaProof is AlphaZero, which is the same path-finding AI that Google DeepMind previously used to master chess, shogi, and Go, and it works in much the same way as AlphaProof: by evaluating which paths show promise.

But doing mathematical proofs is quite different from playing chess or Go. The researchers needed to train the system on the types of problems that AlphaProof would encounter in the IMO. So they fine-tuned a Gemini LLM to translate natural language problem statements into formal ones. Then, using a dataset of ~100k formal problem statements, the researchers trained AlphaProof to prove or disprove them by searching over possible proof steps (i.e., moves) using Lean. When Lean says that the series of steps satisfies the original problem statement, it’s done! Each proof that AlphaProof finds reinforces it, enhancing its ability to solve problems. The figure below shows this process.

AlphaGeometry 2 takes a slightly different approach. The Google DeepMind researchers describe it as a “neuro-symbolic” hybrid system since it uses a language model (based on Gemini) and a symbolic engine. This approach starts by representing the geometric problem as a graph (or network) of geometric symbols (e.g., a point, line, angle, circle, segment, etc.). Then, a language model suggests some next steps (e.g., construct D: midpoint BC). Next, using some primitive geometric relationships like the behavior of parallel lines or midpoints, the symbolic engine searches over possible next steps (kind of like AlphaZero) to see if it can reach the goal of the proof. For instance, in the example below, the language model suggests a construction that helps prove that the triangle is isosceles where, in the right frame, the first blue statement represents the construction, and the subsequent ones represent the subsequent symbolic deductions.

TWIST! What I described just above was actually AlphaGeometry (the original, which was announced in January 2024), not its successor. AlphaGeometry 2 includes three main enhancements that help it solve much more challenging problems:

It’s trained on an order of magnitude more synthetic data.
Its symbolic engine is two orders of magnitude faster, allowing it to expand its search for solutions.
It uses a novel knowledge sharing mechanism, which lets the deduction engine share information from disparate steps/constructions and their subsequent deductions, and — if helpful — combine these disparate steps.

When I was studying math in school, I often found myself using WolframAlpha to check my answers or my work. I see Google DeepMind’s system as an extension of this idea, where mathematicians can use tools like this to help them solve and verify solutions to problems. To some extent, this is already happening!

This year, Australian mathematician Terrence Tao presented a talk on “machine assisted proofs,” where he said that he suspects it’s ~20x harder to write formal proofs (ones computers can verify) than informal ones (ones that mathematicians write, publish, and peer-review). However, he also said that AI integration could change this, potentially tipping the balance in favor of formal proofs, which would have a dramatic impact on the field of mathematics. If we extrapolate the rate of progress made by AlphaGeometry over this year alone, it seems that the balance is definitely shifting, and potentially quite rapidly.

How to use LLMs with spreadsheets

Unbox Research — Fri, 02 Aug 2024 12:22:09 GMT

Summary by Adrian Wilkins-Caruana

[pdf of the paper]

When I hear people say “AI is going to take all our jobs,” what I think they mean is that LLMs like ChatGPT will automate more and more tasks to the point where many tasks don’t require a human anymore. There’s a bit of hand-waving involved in that inference, but I think it’s pretty fair. But LLMs are just text generators, and most jobs involve more than just pressing keys on a keyboard. So, how can we use LLMs to, say, automate a job that involves manipulating spreadsheets? How would it read the spreadsheet, let alone use it to answer questions about the data? Today’s summary is about a new method that lets an LLM do exactly that.

Illustration by Giulia Zerbini

Let’s pretend that we are AI engineers and our job is to make an LLM manipulate a spreadsheet. How might we do this? One way might be to describe each cell using text. Let’s use this approach to describe a very important spreadsheet of mine:

We can describe this spreadsheet like this:

The following text describes a spreadsheet for tracking foods and their ratings. Here’s the data:

Cell A1. Text: “Food”. Formula: None. Formatting: Bold, and centered.
Cell A2. Text: “Ice cream”. Formula: None. Formatting: None.
Cell A3. Text: “Milk & cookies”. Formula: None. Formatting: None.
…

We could then ask the LLM to do some work for us, like “Please tell me how to calculate the average rating,” and it might say:

Cell B6. Text: None. Formula: "=Average(C2:C5)". Formatting: "Numeric, two decimal places".

There are, however, two main issues with this naive approach and others like it: It’s unnecessarily verbose, and its index-first structure isn’t ideal for an LLM. The next bit of this summary explores some clever techniques — developed by Microsoft researchers — to fix these problems.

When you make a spreadsheet, do you use whitespace/empty cells to delineate particular tables or separate different kinds of info? So do I! But it turns out that this whitespace is really unhelpful for an LLM, since it adds a lot of useless, distracting information to a text-encoded spreadsheet. So the researchers came up with a technique called structural anchors, a heuristic-based algorithm that essentially draws boxes around useful information in a spreadsheet. The method then extracts the cells inside these anchors (and a little bit outside the anchors, just in case the structure isn’t perfect), and remaps the addresses so that they make sense without the whitespace.

Continuing on the theme of “things about spreadsheets that humans like but LLMs don’t” are the 2d matrix format, our repetition of some values (e.g., “Dessert” in my spreadsheet), and that we sometimes scatter useful bits of info at seemingly random places. The researchers found that LLMs, being the language-lovers that they are, much prefer a dictionary-like format. So, the researchers created an inverse index–based translation method that flips a spreadsheet on its head. It uses the values of the cells as the primary keys, not the cell indexes. The dictionary values are lists of cell indexes, so the encoding can easily represent repeated values. My spreadsheet above might look like this:

{
    “Food”: A1,
    “Category”: B1,
    “Rating”: C1,
    “Ice cream”: A2,
    “Dessert”: B2, B3,
    “9”: C2,
    ...
}

The researchers realized one more way they could better represent the spreadsheet for an LLM. To understand this trick, keep in mind that spreadsheet-aware LLMs don’t actually have to do any computing on their own, because they can produce commands that are executed by the spreadsheet software. In other words, they simply ask the spreadsheet to perform the calculations, just as a human would. Because of this, the LLM doesn’t need to know the specific numeric values of the cells — it just needs to know what format they are (for example, an integer). So the encoding can represent numeric cells — like integers, floating point numbers, percentages, dates, etc. — using some text that describes their format. In my example above, the encoding represents the Rating values in the dictionary format like this: "IntNum: C2:C5".

Overall, these three tricks reduce the number of tokens needed to represent the spreadsheet by 25x compared to one of the naive encoding methods the researchers considered. They tested their approach on a Spreadsheet QA task, and found that regardless of which base LLM they used (e.g., Llama 3, Mistral 3, GPT-4, etc.), LLMs that use these techniques equalled or outperformed the existing spreadsheet analyzing technique, called TableSense-CNN. The GPT-4 model had the best F1 score on this benchmark, on average scoring 9% higher (76%) than TableSense-CNN (67%).

The researchers also conducted ablation experiments, individually excluding one of their three techniques (anchoring, inverse-index/dictionary encoding, and data type aggregation). While the first two techniques tended to improve the F1 scores with GPT-4, surprisingly, the last technique actually made the F1 scores slightly worse; the best score achieved on their benchmark was 79%, using GPT-4 without aggregation. The researchers hypothesize that this might be because the data types are a bit too abstract for the LLM. Nonetheless, they suggest that the aggregation could be necessary for some models that have a limited context length, since the aggregation reduces the number of tokens the model needs to represent the spreadsheet significantly.

I think it’s remarkable that an LLM can work with spreadsheets so well considering that spreadsheets are fundamentally designed for humans, not computers. It seems to me that a spreadsheet is really an unsuitable tool for an LLM to use for solving problems. Still, the techniques presented in this paper could be really helpful to humans when we use spreadsheet software. For example, we could use it to ask questions like “How can I forecast next month’s expenses?” or “When will we break even?” Given that this paper comes from Microsoft researchers, I’m sure that such features are coming to Excel soon!

Physical neural networks

Unbox Research — Fri, 26 Jul 2024 12:05:06 GMT

Summary by Adrian Wilkins-Caruana

[pdf of the paper]

In a 2007 essay called The Origin of Circuits, Alan Bellows tells a fascinating story about an experiment that Dr. Adrian Thompson conducted in the 1990s. Thompson arranged a 10 x 10 array of logic gates (in a configuration now known as an FPGA) and tried to see if he could evolve a program encoded by these gates to reliably distinguish between signals of two different audio frequencies. With sophisticated logic configurations (like complicated signal processors), this was a trivial task at the time, but even though the array was so small, Thompson found that there were indeed logic configurations that could reliably detect these signals.

Illustration by Giulia Zerbini

Aside from this main result, there are two other remarkable things about Thompson’s experiment. First, the logic configuration was updated iteratively in an evolutionary manner that’s quite similar to evolutionary machine learning algorithms. The other is that, upon investigating the most successful configuration, Thompson noticed a section of the array that was logically disconnected from the array’s output yet, without it, the array couldn’t reliably classify the signals. This means that the disconnected logic section was influencing the classification through some mechanism other than digital logic, and that the evolutionary algorithm seemed to account for the effects of this mechanism as it updated the program. (It turns out some of the logically-disconnected gates were influencing the voltage of other nearby gates via magnetic flux.) I highly recommend giving Bellows’s essay a read if you haven’t before.

With three decades of hindsight, we can see that Thompson’s array of logic gates was an example of a physical neural network, or PNN. PNNs are neural-like networks that aren’t built from silicon chips (though they could be) but instead from components that harness other physical phenomena, like light or sound. In a sense, PNNs offer an alternative paradigm of machine learning, one which isn’t necessarily constrained by the limitations of digital logic. That is to say, PNNs can let us harness various physical phenomena to solve problems with machine learning. Today’s summary explores training PNNs, i.e., the different ways that PNNs’ parameters can be updated to solve particular problems.

Before exploring how PNNs are trained, let me quickly describe the concept of back propagation (BP), the workhorse of traditional, digital neural network training. When an NN makes a prediction (sometimes called a forward pass) and we know what that prediction should be, we can calculate the error in the network's prediction. We can then propagate the error backwards through the network (sometimes called a backward pass), updating the network’s parameters so that it's more likely to give a more accurate answer the next time it runs on similar inputs.

One way of training a PNN, called in silico training, mirrors BP quite closely. In silico training involves digitally simulating and optimizing physical parameters (θ) using a digital twin, which is an emulation of the physical hardware within a computer environment. Similar to BP in traditional neural networks, in silico training uses these digital models to compute gradients and update weights, which you can then apply to the physical system through some other process. This approach benefits from the rapid, cost-effective iteration and testing of PNN architectures, but it might not work well if the digital twin isn’t perfect. This means that the entire process is simulated, and the physical model is only given the learned weights at the end.

Another training approach, called Physics-aware BP training (PAT), is a hybrid of in situ (meaning not simulated) and in silico methods. In PAT, the physical system handles the forward pass, while the backward pass is performed by differentiating a digital model that is an approximation of the physical system. This means the info for the forward pass will still be precise while maintaining the versatility of performing the backward pass on a computer. You still need an accurate digital twin to effectively model the backward pass, and the larger and more complicated the PNN is, the harder it is to make an accurate model of it.

Both of the methods described above are, in a sense, cheating, since they aim to train PNNs using conventional digital NN techniques. But there’s good reason for this, since BP has been shown to be much more effective than other techniques for training digital NNs. But, as we’ve seen with in silico training and PAT, it can be hard to accurately back propagate error signals through a PNN. Are there any other ways? Here are two:

Feedback alignment (FA) is an alternative to BP whereby some of the terms in the weight-update algorithm of BP are replaced by random terms. This essentially transforms the update rule into a random walk in weight space. This means that, unlike BP, we don’t need to know exactly what the weights were in the forward pass to know how to update them.

Physical local learning uses a concept called local learning to train the weights in each block or layer independently (i.e., without any BP). There are lots of different ways that local learning could be achieved, but they typically all try to define some objective function using the layer’s activations, one that indicates whether the activations are doing something useful, like compression or providing useful information for the next layer/block. Geoffery Hinton’s forward-forward technique is one example of local learning, and one study has already used it to train an optical NN with a contrastive-based approach.

These methods aren’t quite as effective as BP, though, so other studies try to reproduce BP without the need for a physical twin; they essentially encode the BP algorithm directly into the physical system. For this to work, the system needs to utilize some physical process that’s a linear reciprocal function, i.e., a system that behaves like y = 1 / x (I’ve omitted the coefficients for simplicity). Two examples are waves propagating through a linear medium in a photonic system, or a peculiar electrical device called a memristor crossbar array. There are also other techniques for in situ training, such as a one called continual learning that updates the model’s parameters as it’s used.

At the end of their review, the authors arrived at three qualities that would be great for a PNN to have, although nothing meets all three (yet):

They don’t depend on the model used.
They give a speed or efficiency advantage over regular NNs.
They are more resilient to noise.

But a PNN doesn’t need to have all three of these qualities to be useful. It just means a bit more effort might need to go into developing them, since we can’t yet say for certain things like, “Oh, this particular learning algorithm works best for this kind of PNN or this kind of application.” Given the current pace of AI developments, the possibility of realizing all three at once could be on the horizon, which could open the doors to an entirely new domain of AI.

Predicting depth data from a single image

Unbox Research — Fri, 19 Jul 2024 11:05:40 GMT

Summary by Adrian Wilkins-Caruana

[pdf of the paper]

Last week we discussed a new method for resolving the 3d structure of a scene from two perspectives or photos of it. I mentioned how our brains can do this too, using the images from each of our eyes to perceive things in 3d. But what if you close one eye; do you lose your ability to see in 3d? The answer is a resounding “No!” Our brains can still perceive a lot of 3d info from a single, unmoving image. In fact, we do this all the time when we look at photos and use things like occlusions and shadows to infer depth and scale. This is why we find optical illusions like the Penrose Triangle perplexing.

Illustration by Giulia Zerbini

If our brains can perceive the depth of objects using information in an image, then computers probably can, too. This process, known as monocular depth estimation (MDE), has been an active area of machine learning research for some time now. Before discussing that research, let’s learn a bit more about MDE. The figure below shows relative depth estimates on some images predicted by a model called Depth Anything V2. The redness/blueness of the results indicate parts of the image that are closest/farthest from the camera, respectively.

Depth Anything V2 is an impressive model. But the story of its success isn’t just “more data” or “bigger model.” To appreciate it, we first need to review how we got here:

MiDaS is an impressive MDE method that broke onto the scene in 2020. It predicts the relative depth of pixels in an image, and was trained using supervised learning on a dataset of over a million images with depth labels. When it was released, it was the state of the art for MDE.
The MiDaS team made incremental improvements, and its third iteration was the state of the art until this year. Despite its success, it struggles to predict depth on images that are different from its training data (that is, zero-shot prediction).
In January, the researchers behind Depth Anything V2 released Depth Anything (V1). While this model introduced several innovations to help zero-shot generalization (like augmentation and a special loss term), its real innovation — and the one that ultimately helped improve its zero-shot generalization — was its use of unlabeled training images. (I’ll explain how this is possible shortly!)
Finally, we now have Depth Anything V2 a short six months after V1. In terms of data, V2 took a drastic approach, ditching labeled data entirely for synthetic data! As we’ll see, it improves upon V1 in a number of ways, including fine-grained details, accuracy, and its ability to not be fooled by confusing surfaces like windows and mirrors.

From this MDE timeline, it’s clear that training data has played a pivotal role in MDE iteration. If you think about this for a moment, it kind of makes sense. The depth “labels” — which are generated from a number of depth-sensing sources like RGBD cameras (yep, the “D” is for “depth”) or LIDAR — can have a lot of issues, like not being as high-resolution as their corresponding image, or they might be noisy or just not very accurate. So MDE models trained on this data will be fundamentally limited by it.

In Depth Anything V1, the researchers came up with a way to use unlabeled images (i.e., regular images) as training data. To do this, they first trained the best model they could on the labeled dataset, then used this model to predict the depth on unlabeled images, and used these predictions as pseudo labels. This is called a student-teacher approach, where the big model trained on the labeled data is the teacher, and its knowledge (i.e., predictions) are used to teach the student model, which is nice to have because it’s much smaller (and so more convenient to work with) than the teacher model. The reason I’m taking time to specifically mention this aspect of V1 (and not any of the other clever tricks they invented) is because it’s crucial for V2. The figure below shows a rough schematic of the student-teacher process, where the solid lines indicate the flow of labeled data/images, and dashed lines represent the flow for unlabeled ones. (The “semantic preservation” is one of the tricks I mentioned, and it prevents the encoder from varying too much from when it was trained on labeled data).

So, Depth Anything V1 used unlabeled data to address the zero-shot generalization problem of MiDaS. V2 goes further to improve the accuracy and robustness of the predictions by extensively using synthetic data, which you can think of as renderings of 3d scenes — like from a video game — where the depth can be measured using the information in the 3d model. While it sounds insane, the choice to completely ditch labeled data actually makes a lot of sense. If the labeled data isn’t accurate enough, just replace it with highly accurate synthetic data, right? Well, there are two very good reasons not to do this: Synthetic data is typically quite different from and far less diverse than real imagery. So there’s a large domain shift between synthetic and real data. But combining synthetic data with both the student-teacher approach and unlabeled real images alleviated these two problems and yielded a model that’s both accurate and has good zero-shot generalization.

I think the best way to demonstrate why using only synthetic data is so helpful is to take another look at the pictures of the bridge and the room above. In the bridge image, there's a lot of fine-grained detail that a depth sensor might not be able to capture. And, in the room image, notice how the depth indicates the window, not the objects that you can see through the window. This is the kind of depth detail that’s really difficult to accurately capture in the real world.

“But Adrian!” I hear you screaming, “Surely there must be some value in all that labeled data.” Well, the Depth Anything V2 authors thought that too. So, when training the student with the pseudo labels on unlabeled images, they tried including a little bit of labeled data — the highest-quality labeled data they had. But they found that mixing in just 5% labeled data (keeping 95% synthetic) seemed to harm performance, particularly for fine-grained details. You might need to squint, but the figure below shows that the model that used synthetic data only (middle column) is definitely superior.

I really like the authors’ approach in the Depth Anything V2 paper, and I think their model’s results are quite impressive. It’s a very “outside-the-box” idea, and I’m curious to see if researchers in other domains can use similar ideas to make other AI breakthroughs. I highly recommend taking a look at the paper’s webpage to see more examples of what they accomplished.

I’ll leave you with one more Depth Anything V2 example below, one that yet again shows how impressive it is while also being a bit of a contradiction to the authors' claims of robustness. Specifically, they claim that the model is quite robust to domain shift (which it certainly is), but they use these depth-prediction results on drawings and paintings (shown below), among others, as an exemplar of this idea. But what is the correct result in this case? Should it be a realistic depth (as Depth Anything V2 predicts) or just the depth of a planar surface, since the paper or canvas is presumably flat? I think it should be flat, since this would be more in-line with predicting windows instead of what’s shown through the window. What do you think?

Vision transformers can see in stereo

Unbox Research — Fri, 12 Jul 2024 20:34:11 GMT

Summary by Adrian Wilkins-Caruana

[DUSt3R paper | MASt3R paper]

Your brain is amazing. Whenever you open your eyes and look around, you experience your surroundings in 3d despite only having two 2d views of it, one from each eye. This 3d environment that you perceive is so good that, without much effort at all, you can accurately judge things like how hard and in what direction you need to throw a ball so that it gets to a specific person, or how far it is between your car and a red traffic light in the distance.

Illustration by Giulia Zerbini

To give you an idea of why it’s amazing that your brain can do this, let’s quickly break down the process of multi-view stereo (MVS) reconstruction, which is computer-speak for “How do I turn the two flat images from these two cameras into a 3d model of what they saw?” First, a computer needs to identify and match the same parts of the images. Then, using information about where each image was taken and other details like the parameters of the camera’s lens, along with some complicated mathematics, it can reconstruct where in 3d-space each of those parts of the image must have been. Believe it or not, that’s actually an oversimplification. The figure below, taken from the paper of a popular MVS technique called COLMAP, shows a breakdown of an actual MVS pipeline.

The approach taken by COLMAP and other MVS techniques — that is, using mathematics and algorithms — seems perfectly sensible to me, and it works quite well. But is this what our brains are doing when they see the world? Maybe. Or maybe our brains operate more like a new machine learning-based approach called DUSt3R.

Like COLMAP, DUSt3R turns images into a 3d point cloud, but in a completely different way. Here’s how it works: First, the model extracts small patches from two images, and then separately encodes them using the same Vision Transformer (ViT) encoder. Then, two ViT decoders share information about these patches via cross-attention to generate one feature vector for each patch in each image. Finally, a “head” (a fully-connected layer) predicts the 3d positions {x, y, z} of each pixel in each image, as well as a confidence value that indicates how confident the network is about each pixel’s prediction. Importantly, the 3d positions predicted by the head for the second image are in the same coordinate space as the one from the first image.

The authors used supervised learning to train their model, which means they needed the corresponding 3d locations of pixels in image pairs of the same scene. Their training dataset consisted of about 8 million examples of this kind of data, and contained both indoor and outdoor images, as well as images of objects. Then, to optimize the model’s parameters, they used a regression loss, which is just the average error, or distance, of where DUSt3R thinks the pixels are versus where they actually were. These errors were each scaled by the confidence value that DUSt3R predicts, which is helpful because sometimes it’s really hard to know the exact location of particular pixels, like ones in the sky or in reflections.

Compared to about a dozen other MVS methods (some neural network-based, others more traditional), DUSt3R performed the best in terms of absolute relative error. But the more impressive result is DUSt3R’s zero-shot prediction accuracy (its accuracy on datasets it wasn’t trained on), where it was almost as good or sometimes even better than non-zero-shot neural approaches. Note that the traditional approaches should also be considered zero-shot, since they weren’t designed for any specific dataset — but DUSt3R still seems to outperform these approaches more often than not. And remember: DUSt3R doesn’t need information about the cameras’ poses or their intrinsic parameters either!

The DUSt3R researchers recently followed up their method with an extension they call MASt3R. MASt3R improves on DUSt3R’s approach by emphasizing pixel matching: matching each pixel in one input image with the pixel of the same point (in the 3d scene) in the other input image. Here’s an example of pixel matching between two input images:

The figure below shows MASt3R’s architecture, which adds an additional head onto the ViT’s decoder. From the image patches, this head generates a vector of features for each pixel in each image; this new vector-per-pixel data provides the additional info MASt3R uses to match pixels across the input images. And MASt3R’s loss function is the same as DUSt3R’s confidence-weighted regression loss, but with an additional loss term that penalizes the model for every pixel that it incorrectly matches.

Without going too deep into the details of MASt3R, this extension (pixelwise matching) adds a lot of additional complexity (efficient pixel-matching is non-obvious), but the authors introduce clever algorithms for solving this and other related problems. All this effort is worth it, though, since MASt3R is both more accurate and more robust to viewpoint and illumination changes than DUSt3R. Also, aside from the main purpose of predicting point clouds, these models’ results can be used for things like camera calibration, inferring the cameras’ pose, depth estimation, and dense 3d reconstruction.

Being neural network-based, DUSt3R and MASt3R share afflictions similar to neural networks in other domains. Despite their impressive zero-shot performance, these approaches might need to be retrained to work effectively in contexts that differ substantially from their training data, such as in underwater or aerial imagery, or imagery with very wide or very long lenses. Traditional MVS approaches would be more robust to these sorts of changes, provided their models can be adjusted for these settings.

In either case, there’s no one-model-fits-all approach to MVS, much like how our brains are particularly well adapted to MVS from our two eyes, but would need to be “retrained” if our vision was suddenly inverted or if the shape and composition of our eyes suddenly changed in some way.This paper hasn’t brought us any closer to understanding how our brains do MVS, but it has shown that there’s more than one process that can achieve it. Maybe our brains work like DUSt3R, or maybe they have their own method that’s still a mystery.