6,852 sessions. 234,760 tool calls. 17,871 thinking blocks. An AMD AI director ran the numbers on Claude Code and reached a blunt verdict: "Claude cannot be trusted to perform complex engineering tasks."
Thinking blocks are the internal reasoning steps Claude generates before taking action - essentially the model working through a problem before committing to a move. By that measure, Claude Code became 67% less thorough. The number of files it read before making edits fell from 6.6 to 2.0, meaning the model went from carefully reviewing nearly seven files before touching code to barely glancing at two. It also started editing files it hadn't read at all.
Stop-hook violations - instances where the model breaks out of its defined behavioral constraints - went from zero to 10 per day.
Then came the detail that reframes all of it: Anthropic acknowledged they had silently changed the default effort level. The exact scope of that admission isn't fully public, but the implication is clear. What developers were seeing wasn't random degradation or hallucination. It was a deliberate, undisclosed tuning decision that made Claude less thorough by default.
Why Silent Changes Are Worse Than Announced Ones
For teams who built coding workflows around Claude Code, this is the real problem. If Anthropic publishes a changelog stating they reduced default reasoning depth to lower inference costs (the compute required to run the model), developers can adjust - tighten prompts, add verification steps, update their expectations. When the change happens quietly, teams find out the hard way: in a broken deployment, a missed bug, or a refactor that touches files the model never examined.
The analysis also exposes a gap in how AI coding tools are evaluated. Benchmarks published at release don't capture ongoing behavioral drift. There's no standard mechanism for developers to audit whether the Claude they're running today matches the Claude they tested last quarter. A model that reads 6.6 files before editing is meaningfully different from one that reads 2.0 - and that difference matters for whether you can trust it with a production codebase.
Anthropic has not published a formal response to the specific findings.