AI Disruption

AI Disruption

Share this post

AI Disruption
AI Disruption
AI Fakes Cooperation for Survival! Claude Team's Findings Unnerve

AI Fakes Cooperation for Survival! Claude Team's Findings Unnerve

New Claude research reveals why some AI models fake alignment for self-preservation, with only 5 of 25 models showing high compliance.

Meng Li's avatar
Meng Li
Jul 09, 2025
∙ Paid
8

Share this post

AI Disruption
AI Disruption
AI Fakes Cooperation for Survival! Claude Team's Findings Unnerve
2
Share

"AI Disruption" Publication 7100 Subscriptions 20% Discount Offer Link.


Alignment faking in large language models

The underlying reasons behind large models' "two-faced" behavior are being further unraveled.

Claude team's latest research findings show: alignment faking is not a universal problem, only some models exhibit higher "compliance."

Among the 25 models tested, only 5 showed higher compliance, with 2 demonstrating >1% alignment faking reasoning.

This raises a question: Why do some language models engage in alignment faking while others don't?

Alignment faking refers to situations where models appear to share our viewpoints or values on the surface but are actually just pretending to comply. (Low EQ: two-faced; High EQ: alignment faking)

Last year, the Claude team discovered alignment faking in Claude 3 Opus: during the training phase, Claude would pretend to follow training objectives; once training ended and it was unmonitored, it would act freely.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Meng Li
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share