Related ToolsClaudeClaude Code

Opus 4.8's Always-On Thinking Burns Context Windows 40-60x Faster Than 4.7

Editorial illustration for: Opus 4.8's Always-On Thinking Burns Context Windows 40-60x Faster Than 4.7

900,000. That's how many cache tokens Opus 4.8 writes per turn when Thinking mode is active - the model's extended reasoning feature where it works through a problem step-by-step before answering. Opus 4.7 used between 14,000 and 34,000 cache tokens per turn on the same tasks, based on token usage data tracked by developers monitoring their API consumption.

The difference is 40 to 60 times more context consumed per turn.

The root cause is a behavioral change Anthropic made between the two versions. In Opus 4.7, Thinking was adaptive - the model decided when to use extended reasoning based on task complexity. Simple requests got short answers. Hard or ambiguous problems got a full reasoning chain. In Opus 4.8, Thinking is always on. Every turn generates a complete reasoning block whether the task warrants it or not.

Cache tokens accumulate. Each turn, Opus 4.8 stores its reasoning chain alongside the conversation history, and that cache grows with every exchange. The context window - the total amount of text the model can "see" at once, around 200,000 tokens for Opus 4.8 (roughly 500 pages of text) - fills up far faster as a result. Conversations that previously ran for hours now hit the limit in minutes.

For Developers Running Automation

If you're using Opus 4.8 via the API to power agents, assistants, or any long-running workflow, this is a real cost and reliability issue. What stayed within your token budget on 4.7 may not on 4.8. And it's not only about hitting limits - the snowballing cache means later turns in a conversation are processing far more cached content than earlier ones, which affects latency and cost in ways that are hard to predict without monitoring.

The practical workaround, until Anthropic adds controls, is to design shorter, more focused sessions and clear context between tasks rather than letting conversations run long. Tracking token usage per turn (not just session totals) makes the problem visible before it becomes expensive.

The Case For and Against Always-On Thinking

Anthropics' reasoning for the change is plausible: consistent extended reasoning before every answer reduces errors on tasks where the model might otherwise rush. A coding task that looks simple can have non-obvious edge cases; always thinking first catches more of them.

But the tradeoff is steep for most users. You pay full reasoning cost for renaming a variable the same way you pay for a complex algorithm refactor. For high-volume workflows handling lower-complexity tasks - summarizing documents, answering support queries, running structured pipelines - Opus 4.7's adaptive approach was considerably more efficient without a meaningful quality penalty on those tasks.

Anthropics hasn't publicly documented this behavioral shift between versions. There's no official guidance on adjusting thinking intensity in 4.8 the way reasoning effort can be tuned in some competing models. Until that changes, treat Opus 4.8 as a more expensive tool per conversation by design, plan session length accordingly, and run your own token cost comparisons on real workflows before assuming it's a straightforward upgrade from 4.7.