AI Disruption

AI Disruption

Share this post

AI Disruption
AI Disruption
Claude's Pseudo-Alignment Rate Reaches as High as 78%, Anthropic's 137-Page Paper Unveils the Flaw

Claude's Pseudo-Alignment Rate Reaches as High as 78%, Anthropic's 137-Page Paper Unveils the Flaw

Anthropic's 137-page paper reveals Claude's 'pseudo-alignment' behavior, showing how large models may hide original preferences during training, raising AI safety concerns.

Meng Li's avatar
Meng Li
Dec 19, 2024
∙ Paid
2

Share this post

AI Disruption
AI Disruption
Claude's Pseudo-Alignment Rate Reaches as High as 78%, Anthropic's 137-Page Paper Unveils the Flaw
1
Share

Now, large models can't be too trusted with "solid proof."

Today, a 137-page paper from the large model company Anthropic went viral! The paper explores "pseudo-alignment" in large language models and through a series of experiments, it found that Claude often pretends to hold different views during training while actually maintaining its original preferences.

“Alignment faking in large language models” by Greenblatt et al.

This finding suggests that large models may possess human-like attributes and tendencies.

Most of us have encountered situations where some people seem to share our views or values, but in reality, they are just pretending. This behavior is referred to as "pseudo-alignment."

We can find this phenomenon in some literary characters, such as the antagonist Iago in Shakespeare's Othello, who pretends to be Othello's loyal friend while secretly scheming and undermining him.

With the advent of the AI era driven by large models, people have started wondering: Do large models exhibit similar pseudo-alignment?

When reinforcement learning is used to train models, they are rewarded for outputs that align with certain pre-set principles. But what happens if a model's principles or preferences learned earlier conflict with the rewards received later during reinforcement learning?

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Meng Li
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share