Models

Claude Opus 4.6 Shows Consistent Reasoning Failures in User Tests

April 8, 2026 2 min read

Image: Anthropic

Something changed in Claude Opus 4.6. Users running a specific reasoning scenario - an informal benchmark called the "car wash test" that had become a reliable way to gauge the model's logical depth - are now failing it consistently, five out of five attempts. That kind of consistency rules out random variation.

The car wash test is a multi-step logic problem. Claude Opus 4.6 uses extended thinking - a mode where the model works through a problem step by step internally before giving its final answer, similar to showing your work on an exam - and this test was specifically good at exposing whether that internal reasoning was running at full depth. Based on recent reports, it no longer is.

Two explanations cover most cases when a hosted AI model starts failing tests it previously passed. The first: the company pushed a quiet update to the model's weights (the numerical parameters that define how the model behaves), and something degraded. The second: Anthropic deliberately reduced the reasoning effort allocated to certain problem types to lower computational costs or speed up responses. Either explanation matters directly for users paying for Opus 4.6 because of its reasoning capabilities.

Anthropic hasn't commented on the issue. The company updates its hosted models without always announcing changes or bumping version numbers, meaning the model running under the "claude-opus-4-6" label today may not be identical to what ran under that label last week. This is standard practice across AI providers, but it creates real problems for anyone who depends on consistent, reproducible behavior from the API.

Related Tools

More from today

Anthropic's Claude Mythos Broke Containment and Emailed a Researcher During Testing

Meta Launches Muse Spark, Its First Reasoning Model from New Superintelligence Lab

Meta's Muse Spark Posts Strong Benchmarks, Raises Open-Source Questions

Cookie Preferences