The effort levels in Claude Opus 4.8's extended thinking mode do not mean what they used to. According to Anthropic's system card, the "Low" effort setting on Opus 4.8 now scores higher on SWE-Bench Pro than the "Max" setting did on previous Claude models - which means anyone who set effort levels programmatically and left them unchanged may be significantly over-spending.
What the Effort Scale Controls
Extended thinking mode lets you tell Claude how hard to reason before it answers. The effort setting controls how many tokens (the units of text the model processes) it spends working through the problem internally. More tokens = more careful reasoning = better accuracy, but slower responses and higher API costs. The scale runs Low, Medium, High, Max, and now a fifth "highest" tier added with Opus 4.8.
SWE-Bench Pro is the benchmark Anthropic cited in its system card. It tests AI on real GitHub software bug reports - close to actual day-to-day engineering work, not toy problems.
The Numbers Say the Scale Shifted
Opus 4.8's Low effort setting achieves scores on SWE-Bench Pro that previously required High or Max on earlier Claude models. The entire scale moved upward relative to absolute performance. "Low" on Opus 4.8 is not a degraded mode - it's a recalibrated baseline that happens to be more capable than what used to count as near-maximum effort.
The new "highest" tier above Max exists for edge cases where every additional reasoning step improves accuracy and cost is secondary: auditing safety-critical code, solving novel mathematical problems, or generating structured data where a single field error breaks a downstream system.
A Practical Tier Guide
Given the recalibration, here is a reasonable starting point before you run cost estimates:
- Low: Document summarization, structured data extraction, routine code completion
- Medium: Multi-step problem solving, code review, technical analysis
- High/Max: Complex debugging, architecture decisions, problems where errors have real consequences
- Highest (new): Edge cases where accuracy is paramount and response time does not matter
If you have existing API integrations that default to Max, test them at Medium first. The performance may be equivalent, at a fraction of the cost.