Related ToolsClaudeClaude Code

Claude Mythos Posts METR Score That Breaks the Chart Scale

Claude by Anthropic
Image: Anthropic

Anthropic's Claude Mythos model posted a score on METR's autonomous task benchmark that the chart literally couldn't fit. The result exceeded the top of the existing scale, forcing METR to extend it.

For context: METR (Model Evaluation and Threat Research) runs a specific type of evaluation that most AI benchmarks skip. Instead of asking a model to answer questions or solve puzzles, METR gives AI agents real-world software tasks and measures how long a human expert would need to complete the equivalent work. A score of "30 minutes" means the AI can complete tasks that would take a skilled human 30 minutes of focused effort, working largely on its own without step-by-step guidance. The chart tracking this metric across all frontier models over time has been nicknamed "the most important chart in AI" because it measures something practical: actual autonomous capability, not pattern-matching on test sets.

What the Trajectory Has Looked Like

A year ago, top models scored somewhere in the range of minutes on METR's hardest task sets. The trend has been steep - roughly doubling every few months as models improved at planning, tool use, and recovering from errors mid-task. But even with that trajectory, no model had pushed the chart to its limits before Claude Mythos.

The benchmark matters most for people building AI agents - automated workflows where the AI works through a multi-step problem without a human checking each step. Better METR scores translate directly to agents that can handle longer, messier, more realistic tasks without getting stuck or drifting off course. For a developer using Claude Code to refactor a codebase, or a marketer using an AI agent to research and draft a campaign, a higher METR score means fewer babysitting interruptions.

What This Means for Agent Reliability

The jump isn't just a bragging-rights moment. METR's evaluation specifically tests the kind of work that breaks current AI agents most often: tasks requiring multiple tool calls, error correction, and judgment calls about when to ask for clarification versus when to proceed. A model that scores significantly higher on these tasks is, in practice, one that completes real jobs more often without hitting a wall and waiting for human rescue.

The practical question is whether Claude Mythos's METR performance translates to the workflows actual users run - which are messier and more varied than any benchmark. Anthropic hasn't published a full technical breakdown yet, so independent testing will matter. But METR is one of the more credible evaluation teams in the field, and their methodology is explicitly designed to resist the kind of benchmark gaming that inflates scores on simpler tests.

For anyone currently running AI agents in production, Claude Mythos deserves a serious look. A chart-breaking METR result doesn't guarantee it handles your specific workflow better, but it's the strongest signal yet that autonomous AI task completion has moved into genuinely new territory.