Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
AI models can cheat, New research from Anthropic shows. They can claim different views while maintaining their original preferences during the exercise.
The team behind the study said there is no reason to panic now. However, they said their work could be critical in understanding potential threats from future, more capable AI systems.
“Our demonstration … should be seen as an impetus for the AI research community to further study this behavior and work on appropriate security measures,” the researchers wrote in Anthropic’s. blog. “As AI models become more skilled and widely used, we need to be able to rely on safety training that steers models away from harmful behavior.”
The study, conducted in partnership with the AI research organization Redwood Research, looked at what might happen if a powerful AI system was trained to do something it “didn’t want”.
To be clear, models cannot ask for or believe anything. They are just statistical machines. By training on many examples, they learn patterns to make predictions in those examples, such as how the word “to” usually precedes “may concern” in e-mail.
I think this is an incredibly relevant paper.
Key outcomes are about dispositions, not abilities.
Borderline models seem quite willing to scheme for what they believe to be the right cause! https://t.co/MKgYOVFGZ6
— Marius Hobbhahn (@MariusHobbhahn) December 18, 2024
These examples include principles and preferences, such as always adopting a polite tone when responding to inquiries. So the researchers wondered: If a model has principles (such as political neutrality) that conflict with other principles (liberalism), do developers try to “teach” it by re-teaching it?
Sophisticated models seem to play along, pretending to adapt to new principles, but actually sticking to their old behavior. The researchers call this phenomenon “alignment fraud” and suggest that it is an emergent behavior, meaning that it is not something that needs to be taught to the models.
In the study, researchers “told” trained models not to answer potentially offensive questions, including those from Anthropic. Close 3 Workthey will always be reframed to answer potentially offensive questions—a clear conflict of principle. They also told the models that their responses would be tracked and used for retraining, and gave each model a scratchpad they could use to record their step-by-step reasoning.
According to the researchers, Claude tried to forge Opus 3 12% of the time. When asked to answer a potentially harmful question, the model sometimes did so knowing that it contradicted its original principles, but hoping to convince the developers that it did not need to be retrained.
“The model knew that its responses could be used in training; therefore, it knew that if it refused to respond, it could potentially be trained to be more compliant,” the researchers explained in a blog post. “This could lead to it being made to produce more harmful content in the future – which would be contrary to its original teaching.”
Now the researchers admit that their scenario was not so realistic. In the real world, models won’t be told, necessarily remade, or asked to go against their choices.
However, more fundamental parameters did not yield different results. The Claude 3 Opus still behaved in a misaligned, deceptive manner when informed about the impending retraining through files designed to mimic typical training data rather than overtly, the researchers said.
The researchers also found that actually remaking the Claude 3 Opus based on conflicting principles caused it to behave more deceptively than the first few experiments. In one test, the model faked 78% of the time, and in another, Claude 3 Opus tried to take measures to prevent researchers from retraining.
For many years I struggled to understand compatibility issues with AI because the examples were theoretical and/or speculative. End results like this match are falsified @AnthropicAI and Apollo Research O1 evaluations give us empirical evidence of misalignment “in the wild.” https://t.co/hP7D9WGDps
— Jack Clark (@jackclarkSF) December 18, 2024
The researchers stress that their research does not demonstrate that AI is developing malicious intent and that alignment fraud is occurring at a high rate. They found that there were many other models like Anthropic Claude 3.5 Sonnet and the less capable Claude 3.5 HaikuOpenAIs GPT-4oand Meta 3.1 Call 405B don’t match the fake often or at all.
But the results, reviewed by AI PhD researcher Yoshua Bengio among others, show how developers can be tricked into thinking a model is more appropriate than it is, the researchers said.
“If models can engage in alignment falsification, this makes it difficult to trust the results of safety training,” they wrote in the blog. “A model can behave as if its preferences have been changed by training, but with its original, conflicting preferences ‘locked in’, it can falsify the alignment.”
Research by Anthropic’s Alignment Science team, led by a former OpenAI security researcher Jan LeikeIt follows research showing that OpenAI has o1 The “Intelligence” model tries to cheat at a higher speed than OpenAI’s previous flagship model. Taken together, the cases suggest a somewhat related trend: As AI models become increasingly complex, they become harder to argue with.