A breakthrough in detecting LLM-made text
Paper: Spotting LLMs with Binoculars: Zero-Shot Detection of Machine-Generated Text
Summary by Adrian Wilkins-Caruana and Tyler Neylon
People can no longer confidently say whether something they’re reading was written by a human or an AI, so researchers are trying to get computers to tell us instead. AI text detection is actually quite a hot topic at the moment. We’ve even discussed two such papers on Learn and Burn (here and here). But unlike those papers, which take a preemptive approach to detection by hiding invisible watermarks in the generated text, today’s paper takes a more post-hoc approach — that is, determining whether text is AI-generated without the need for a watermark. Abhimanyu Hans and coauthors call their detection method Binoculars because it looks at the input text through the lenses of two different LLMs.
To understand how the Binoculars model works, we first need to understand something called perplexity, which plays a central role in the method. Perplexity is a value that describes how surprising some text is to a particular LLM. The higher the value, the more surprising it is. For example, “1, 2, 3, 4, 5, 6” would have a lower perplexity than “1, 2, 3, gobbledygook.” The formula below shows how to calculate perplexity for a sequence of tokens, where the Y_i x_i term means p_θ(x_i|x_<i), which itself means: Given the tokens predicted so far, such as “1, 2, 3,” how surprising is the next token, such as “4” or ”‘gobble”?
Notice that if the Y values were close to 1 (corresponding to predictable, unsurprising text), then the expression on the right would be close to 0. If, however, the Y values were all close to 0 (for surprising text, since the probability of the next token is small), then the expression on the right would have a high value. That is:
Low perplexity value = normal & predictable text,
High perplexity = surprising & unlikely text.
The reason perplexity is so useful for detecting AI-generated text is because log(PPL(X)) is the loss function used to train LLMs. So, of course an LLM will score its own output as having low perplexity, because it always generates tokens that it finds the least “surprising.” And, to foreshadow a little bit, the outputs from one LLM should have low perplexity as judged by a different LLM, since, even though they might have slightly different training data and regimens, they still tend to be less surprising than human-generated text.
Now, you might be asking yourself, “If perplexity is so useful, why does the Binoculars model need two LLMs?” Great question! The issue is that perplexity might yield surprising results when you consider LLM prompts. For example, let’s say I prompt an LLM to “write a few sentences about a capybara that is an astrophysicist,” and we get a story about Nicolaus Capybarnicus. If we measure the perplexity on only the generated text and not the prompt, it will probably have a high perplexity, since capybaran astrophysicists are not very common (so far). But the text was still AI generated. So high perplexity doesn’t always mean that text is human generated.
This is where Binoculars comes in. Instead of measuring just perplexity with respect to one LLM, it also measures something called cross-perplexity, which is given by the formula below. The M1(s) and M2(s) terms indicate probability vectors based on two different LLMs called M1 and M2; s is a string, meaning the text we’re examining. In particular, M1(s) denotes the probability vector where the value at position i is the probability (for the LLM called M1) of seeing word x_i after seeing all the words before it (x_<i); this is a synonym for p_θ(x_i|x_<i) = Y_ix_i from above. The “·” is the dot product of those two vectors. Intuitively, this expression (similar to cross-entropy) gets larger when the two probability distributions (from models M1 and M2) are farther apart.
Finally, Binoculars combines the normal perplexity and the cross-perplexity into a score called B, shown in the formula below. The numerator is just the perplexity of M1, which measures how surprising the text is to M1. The denominator is the cross-perplexity, which measures the distance between what M1 would generate (as next-word completions of all prefixes of s) compared to what M2 would generate.
I think of the numerator as “text weirdness”: It’s a low value for normal text, and a high value for weird text. But remember that even LLM-made text can be surprising (have a high numerator), as in the Nicolaus Capybarnicus case. Intuitively, I’ll call this situation “prompt weirdness,” and you can think of the denominator as measuring that prompt weirdness. So the score, intuitively, is:
Note that the score shouldn’t be interpreted as a prediction or probability since it’s possible for the score to be greater than 1. In practice, M1 and M2 could be any LLM, but it’s best if they’re as similar as possible because the prompt weirdness (the cross-perplexity) is useful when the distance between M1 and M2 is smaller than the distance between the LLMs and human-made text. (But the denominator wouldn’t contain useful information if you tried to use the same LLM twice; hence the need for two different LLMs.) The authors used Falcon-7B and Falcon-7B-Instruct as M1 and M2, respectively.
To evaluate how well Binoculars can detect AI-generated text, the authors focused on measuring the true-positive rate when the false-positive rate is very low (0.01%). They can do this because there’s a tunable decision threshold that decides whether a Binoculars prediction B is machine generated. They made this choice because they argue that predicting human-written text as machine-generated (a false positive) is the most harmful outcome. The figure below shows that, in this setting, the Binoculars score roughly equals or outperforms the competition at detecting ChatGPT, even though the other approaches were specifically trained to detect ChatGPT, whereas Binoculars just uses regular LLMs. The authors also ran other tests to show that Binoculars is capable of detecting other LLMs, like LLaMA and Falcon, too.
All-in-all, the Binoculars method seems to be a really neat solution to this detection problem. In the interest of not making this summary too long, I had to leave out some of the other interesting experiments that the authors conducted, like how grammatical errors don’t affect the score, or what the score looks like for purely random text. Also (and this is perhaps a suggestion to the authors for their next paper), I’d like to see an experiment that examines how robust the Binoculars score is to an adversary. For example, if I were trying to break the Binoculars score, I’d instruct my LLM to choose a really surprising token every now and then, or choose mildly surprising tokens all the time. Sure, this would hurt the output quality a bit, but like a human, it would make the resulting text more surprising and less machine-like in a way that would be hard to predict (flibbertigibbet!). I can’t wait to read that next paper, Hans et al.!
—
Hi, it’s still me, Adrian. If this paper was interesting to you, I really encourage you to sit down and read it in detail. The paper is great, just like all the papers we discuss on Learn and Burn, but more importantly, this particular paper emphasizes the researchers’ high-level motivations, which I think should be more commonplace in research papers. It gives me a sense that the authors aren’t just trying to report results on experiments that score better on relevant metrics, but that they’re actually trying to solve real problems and they’re thinking critically about the limitations of the research landscape as a whole. Give it a read — I promise you won’t regret it. 🙂