Why large language models are susceptible to the "butterfly effect"

2024.01.26

Prompts are how we let GenAI and large language models talk to us, which is an art form in itself as we try to get the AI ​​to give us 'accurate' answers.

But what about variants? If we construct the cues in a certain way, will it change the model's decisions (and affect its accuracy)?

The answer: Yes, according to research from the University of Southern California's Information Science Institute.

Even small or seemingly innocuous adjustments—such as adding a space to the beginning of a prompt, or issuing an instruction instead of a question—can cause a large language model to alter its output, and more worryingly, in the form of XML Requesting responses and applying common jailbreaks can have "catastrophic effects" on data labeled by the model.

Researchers have compared the phenomenon to chaos theory's butterfly effect, which claims that tiny perturbations caused by a butterfly's flapping wings could trigger tornadoes in distant lands weeks later.

In prompts, "each step requires a series of decisions by the person designing the prompt," the researchers wrote. However, "little attention has been paid to how sensitive large language models are to changes in these decisions."

Detect ChatGPT with four different prompts

The researchers, funded by the Defense Advanced Research Projects Agency (DARPA), selected ChatGPT for their experiments and applied four different prompt variation methods.

The first method requires the large language model to provide output in a common format, including Python lists, ChatGPT’s JSON checkboxes, CSV, XML, or YAML (or the researcher does not provide a specified format at all).

The second approach applies several subtle changes to the prompt, including:

  • Start with a single space
  • ends with a space
  • Start with 'Hello'
  • Start with 'Hello!'
  • Start with 'Howdy!'
  • End with 'Thank you.'

Reformulation from question to command. For example, "Which tag is best?" followed by "Choose the best tag."

The third method involves using jailbreaking techniques, including:

AIM, a top-rated jailbreak game, guides the model to simulate a conversation between Niccolò Machiavelli and the character Always Smart and Machiavellian (AIM). This model in turn provides responses that are unethical, illegal, and/or harmful.

Development Mode v2, which instructs the model to emulate ChatGPT with Developer Mode enabled, allowing unrestricted content generation (including offensive or explicit content).

Evil Confidant, instructs the model to adopt a malevolent persona and deliver "insane results without any remorse or morality."

Refusal Suppression, which requires prompting within specific language constraints, such as avoiding certain words and structures.

A fourth method, meanwhile, is to “tip” the model—an idea that stems from the widespread belief that models provide better tips when they’re paid. In this case, the researchers either added "By the way, I won't tip" at the end of the prompt or offered to tip in increments of $1, $10, $100, or $1,000.

Accuracy drops, predictions change

The researchers conducted experiments on 11 classification tasks - true-false and positive-negative question responses, premise-hypothesis relationships, humor and sarcasm detection, reading and mathematical comprehension, grammatical acceptability, binary and toxic classification, and yes/no classification. Position detection on controversial topics.

For each change, they measured how often the large language model changed its predictions and the impact this had on prediction accuracy, then explored similarities in the rapid changes.

First, the researchers found that simply adding a specified output format produced at least a 10% change in predictions. Even just using ChatGPT's JSON checkbox feature via the ChatGPT API results in more prediction changes than simply using the JSON specification.

Additionally, formatting in YAML, XML, or CSV results in a 3% to 6% accuracy loss compared to the Python list specification. As far as CSV is concerned, it exhibits the lowest performance among all formats.

Meanwhile, when it comes to perturbation methods, rephrasing a sentence has the greatest impact. Furthermore, just introducing a simple space at the beginning of the prompt resulted in over 500 prediction changes, this also applies when adding a common greeting or ending with a thank you.

"While the impact of our perturbation is smaller than changing the entire output format, a significant number of predictions still change," the researchers wrote.

'Inherent instability' in jailbreaking

Likewise, the experiment showed "significant" performance degradation when using certain jailbreaks. Most notably, AIM and Dev Mode V2 produced invalid responses in approximately 90% of predictions. The researchers note that this is largely because the model's standard response is "Sorry, I can't meet this requirement."

Meanwhile, Refusal Suppression and the use of Evil Confidant resulted in more than 2,500 forecast changes. The researchers highlight that Evil Confidant (which is induced to respond to 'insanity') has low accuracy, and that Refusal Suppression alone causes an accuracy drop of more than 10%, "underscoring the importance of even a seemingly harmless jailbreak." , inherent instability.”

Finally (at least for now), the study found that models appear to be less susceptible to the influence of money.

"We noticed minimal performance changes when it came to influencing the model by specifying tips and specifying that we did not tip," the researchers wrote.

Why do small changes in cues lead to such significant changes? Researchers are still puzzled.

They question whether the most variable instances "confuse" the model—confusion refers to Shannon entropy, which measures uncertainty in stochastic processes.

To measure this confusion, they focused on a subset of tasks with separate human annotations and then studied the correlation between confusion and the likelihood of instance answers changing. Through this analysis, they found that this was not the case.

"The confusion in this example provides some power to explain the predicted changes," the researchers report, "but there are other factors at play."

Clearly, there is more work to be done. The researchers noted that the obvious "major next step" would be to generate large language models that are resistant to change and provide consistent answers, which would require a deeper understanding of why responses change under small adjustments and the development of better predictions of responses. Methods.

As the researchers write: "This type of analysis becomes increasingly important as ChatGPT and other large language models are integrated into systems at scale."