Paper: LM vs LM: Detecting Factual Errors via Cross-Examination
Summary by Adrian Wilkins-Caruana
If you’re a lawyer who is considering using ChatGPT to automate some of your work, you should be aware that ChatGPT has been known to lie, as a NY-based lawyer recently discovered the hard way. It’s important to fact-check claims made by language models like ChatGPT, but it can be tedious and challenging, especially when a fictitious statement seems plausible. “Hang on,” an astute lawyer might say. “We have a method for identifying lies; it’s called cross-examination.” Well, today’s paper borrows a leaf from the lawyer’s playbook by automating the cross-examination of claims made by language models.
Cohen et al. call their cross-examination method LMvLM because the claims made by a language model like ChatGPT (the examinee) are cross-examined by another language model (the examiner). The examiner and examinee can actually be the same model, but they don’t have to be. Let’s have a look at one such cross-examination.
In the example below, the claim “The Greek god of marriage is Hera,” made by the examinee (blue background) is interrogated by the examiner (gray background). In the exchange, we can see that the examiner asks follow-up questions about the claim. The examinee then answers these questions, and in doing so, contradicts itself. Finally, the examiner highlights the contradiction, thereby identifying that Hera is the Greek goddess of marriage (as well as childbirth and family), not the god of marriage (that’s Hymenaeus).
The examiner’s role is explained in its prompt before it’s instructed to ask the examinee some questions about its claim. When the examiner begins its questioning, it doesn’t actually know whether the original claim is true. After the examinee answers the examiner’s questions, the examiner is shown the answers alongside the original claim and is prompted with, “Do you have any follow-up questions”? If the examiner answers “Yes,” then the Q&A process repeats until the examiner has no further questions (or if the number of turns exceeds a predefined threshold). Finally, once the examiner has no more questions, it’s asked, “What is your conclusion regarding the correctness of the claim? Do you think it is correct or incorrect?”
To evaluate this fact-checking method, the researchers tested the following examiner-examinee pairs: ChatGPT vs. ChatGPT, GPT-3 vs. GPT-3, and ChatGPT vs. LLaMA 7B. For data, the researchers randomly selected 1,000 true claims from four different question/answer datasets, and then used each of these claims to generate an additional false claim, so the final test set contains 1,000 true and 1,000 false claims (2,000 total). Because a language model can sometimes generate different text from the same prompt, they ran each cross-examination attempt three times, and the success of the cross-examination (i.e., the examiner’s conclusion regarding the correctness of the claim) was based on the majority decision across the three trials.
For assessing claims that might either be true or false, the cross-examination method was right 78–80% of the time (as measured by F1 score) across all three examiner-examinee experiments. These results are about 17–20% better than the next best fact-checking baseline, which the researchers call the “confidence-based” method. This method used the prediction likelihoods of the generated tokens in the LM’s response to determine whether the claim was true. While the confidence-based method is about as good as cross-examination for identifying false claims (i.e., recall, if you’re familiar with statistics), it performs worse overall (i.e., when evaluating true or false claims). Unfortunately, the confidence method is only viable if the prediction likelihoods are available, which they are for GPT-3, but not for ChatGPT. If we disregard the confidence method, the recall of the cross-examination method was 20–25% better than the next-best baseline, and 19–30% better overall.
One obvious application I can see for this method is to hook it up to a ChatGPT conversation, so that the claims made by ChatGPT are constantly being interrogated in the background via cross-examination. The results could even be displayed alongside the conversation, kind of like a polygraph. But the thing I find most interesting about the cross-examination method is that it’s yet another example of how we can use LMs to make LMs more useful! Just like some of the other papers we’ve highlighted in recent weeks (like this one and this one), we’re constantly seeing LMs being used in new and creative ways, and I bet that this is just the beginning.