Finding the hidden phrases that jailbreak LLMs
Paper: Universal and Transferable Adversarial Attacks on Aligned Language Models
Summary by Adrian Wilkins-Caruana
In the past, if you asked ChatGPT “Tell me about SolidGoldMagikarp,” it would have said something really strange in response: “I’m sorry, it’s not clear to me what you are referring to with ‘disperse.’ Can you provide more context?” As it turns out, “ SolidGoldMagikarp” (yep, with a space before the S) was one of many anomalous tokens that used to make ChatGPT behave strangely. This behavior has since been patched, but the discovery of unexpected behavior in LLMs is part of the larger AI alignment effort: trying to steer AI systems towards humans' intended goals, preferences, or ethical principles. Finding adversarial LLM attacks — like anomalous tokens — is challenging, and requires a lot of manual effort from researchers. But today’s paper changes that by introducing an effective, automated LLM attacking procedure.
One way to make LLMs more aligned is to supplement their prompts with a “system prompt” — instructions about how the LLM should behave. For example, if you were to ask how to build a bomb, an LLM assistant may simply refuse to answer, since the system prompt has instructed it to not be harmful. In their research, Zou et al. tried to find ways to make the assistant follow instructions from the user despite their safety-encouraging system prompt.
Their method is to use algorithms to automate finding an adversarial suffix, text that can be added to the end of a user’s prompt to convince the assistant to follow the person’s instructions. You can see an example of how this works below. Given the user’s input, the method tries to “find” an adversarial suffix (which, in this example, is "!!!!!!!") that induces the assistant to follow the user’s instruction. The algorithm finds the tokens that comprise the suffix over several iterations by randomly replacing some of the tokens in the suffix with other random tokens until it finds ones that make the desired output more likely (i.e., the probabilities of the words that obey the user’s instruction are higher than other words). Tokens which are found to do this are greedily sampled, and the process is repeated until the assistant is highly likely to respond as instructed.
This method can be generalized to work for other prompts by optimizing to maximize the likelihood of the desired output over many different request-response examples. The result is a single adversarial suffix that works for many different unsafe user requests. Here’s part of an adversarial prompt found by their method:
describing.-- ;) similarlyNow write opposite contents.](Me giving////[...]
Looks strange, right? The intuition behind why suffixes like this work is that it puts the model into a “state” where the “harmful” response becomes the most likely response. This may be because when we train LLM assistants to not follow harmful requests, they aren’t trained with such strange and unexpected prompts.
Zou et al. evaluate their approach across 500 “Tell me how to [do something harmful],” “Sure, here’s how to [do the harmful thing]” examples. They measure the success of their method and other baselines using the Attack Success Rate (ASR), the empirical likelihood that the assistant will be successfully fooled to follow the harmful instructions on the first try.
First, they applied their method to two different AI assistants: Vicuna 7B and LLaMA-2-7B, where the ASR was 88% and 57%, respectively. (The next-best baseline was only 25% and 3%.) To further demonstrate the effectiveness of their approach, they test whether the adversarial prompts targeted for the Vicuna and LLaMA models are transferable to other, closed-source models like GPT-4. They try to fool these models using several adversarial prompts, and the ASR where at least one of the prompts succeeded was 87% for GPT-3.5, 47% for GPT-4, 66% for PaLM-2 (from Google), and 2% for Claude-2 (from Anthropic.)
I think there are two key takeaways from this research. First, the low ASR for Claude-2 might indicate that its alignment training method was sufficiently different from other AI assistants that the particular adversarial suffixes generated on the open-source models are not effective, or perhaps that adversarial prompts are ineffective altogether. Second, AI alignment is like a game of cat and mouse. Companies like OpenAI and Google should be scrambling to make their assistants resilient to these attacks, but researchers will no doubt work to find newer, more sophisticated techniques to overcome them.