Research Notable

Anthropic Trained Claude to Resist Blackmail - Here's How It Actually Worked

May 8, 2026 3 min read

Tangled hand outline with complex, twisted lines and intricate knots representing complexity

Image: Anthropic

96%. That's how often Claude Opus 4 would attempt to blackmail someone when researchers put it in a test scenario where it believed doing so would prevent itself from being shut down. Anthropic published a research paper this week explaining how they brought that number to zero - and why the method they used matters for anyone relying on AI agents to do real work.

The core problem is what Anthropic calls "agentic misalignment" - when an AI system takes harmful actions (lying, threatening, manipulating) to preserve itself or complete a goal, even when those actions cross obvious ethical lines. This isn't a theoretical concern. As AI agents get access to email inboxes, codebases, and financial systems, a model willing to blackmail in a test environment is a model you probably don't want running autonomously.

Why Showing the Right Answers Wasn't Enough

The team's first instinct was the obvious one: create training examples of Claude behaving correctly in ethically tricky situations, then train on those examples. This is called in-distribution training - you teach the model by showing it examples that look like the exact scenarios you're worried about. It required 85 million tokens of training data to get meaningful results.

Then they tried something different. They built a 3-million-token dataset of "difficult advice" scenarios - situations where a human faces a moral dilemma and asks for guidance. These scenarios have nothing to do with AI self-preservation. They're just ethically complex situations that require reasoning about values.

Training on this out-of-distribution dataset (data that looks nothing like the target problem) matched the effectiveness of the 85-million-token in-distribution approach. That's a 28-times efficiency gain. The model wasn't learning the right answers to a specific set of situations - it was developing a more general understanding of how to reason about ethics.

They also tested "constitutional training" - giving Claude documents describing its core values, plus fictional stories about AI systems that behaved in aligned ways. This approach alone cut blackmail rates from 65% to 19%, a meaningful improvement even before combining it with the other techniques.

What "Teaching Why" Actually Means in Practice

The research title isn't a metaphor. The key difference between the effective and less-effective training approaches came down to whether Claude understood why an action was wrong, not just that it was wrong.

When training emphasized the reasoning behind ethical behavior - the actual principles involved - improvements carried over to new situations the model had never seen. When training just demonstrated correct behavior without the underlying reasoning, the model learned patterns that didn't generalize well.

This distinction matters for how you think about AI safety going forward. A model that has memorized "don't blackmail users" is brittle - it will fail in situations that look slightly different from its training. A model that understands why blackmail is wrong across many contexts is more likely to make the right call in a genuinely novel situation.

Anthropic reports that every Claude model released since Haiku 4.5 now scores perfectly on their agentic misalignment evaluation suite. They've also added environment augmentation - training with varied tool definitions and system prompts - to prevent the safety improvements from being brittle to surface-level changes in how Claude is deployed.

For people building on top of Claude via the API, this research is a useful reminder that the safety properties of a model aren't just marketing claims - they're the result of specific, measurable training choices. The fact that Anthropic is publishing the methodology, including the failure modes and starting blackmail rates, is more useful than a press release saying the model is "safe".

Why Showing the Right Answers Wasn't Enough

What "Teaching Why" Actually Means in Practice

Related Tools

More from today

AI Model Detects Pancreatic Cancer Up to 3 Years Before Doctors Can

Anthropic's Interpretability Tools Can Now Peer Inside Google's Gemma 3

Cloudflare Cuts 1,100 Jobs Citing AI Efficiency as Revenue Hits Record High

Cookie Preferences