Tools

Claude Code's Attribution Header Slows Local LLMs by 90%

March 10, 2026 1 min read

Image: Anthropic

Running open-source models locally through Claude Code? There's a good chance you're getting a fraction of the speed you should be.

Unsloth's documentation for running local LLMs with Claude Code flagged a significant performance problem: Claude Code prepends an attribution header to requests that invalidates the KV cache (the mechanism that stores previously computed results so the model doesn't have to reprocess them from scratch). The result is a roughly 90% slowdown on inference speed - turning a fast local setup into something barely usable.

The fix is a single config change, but it's easy to miss. You need to add "CLAUDE_CODE_ATTRIBUTION_HEADER": "0" to the env section of ~/.claude/settings.json. Setting it as a regular environment variable won't work - it has to be in the settings file directly.

Unsloth's guide recommends pairing Claude Code with either Qwen3.5-35B-A3B or GLM-4.7-Flash, which they call the strongest 35B Mixture of Experts models available right now. Both fit within 24GB of VRAM when using UD-Q4_K_XL quantization (a compression format that shrinks the model while keeping accuracy close to the full version). The setup runs through llama.cpp on port 8001, with Q8_0 KV cache quantization to further reduce memory usage.

This is the kind of issue that quietly wastes hours. If you've been running local models through Claude Code and wondering why everything feels sluggish compared to direct llama.cpp inference, this is probably the reason. One line in a config file is the difference between a viable local coding assistant and an unusable one.

Related Tools

More from today

The Circular Logic Problem: When AI Writes Both Your Code and Your Tests

Developer Claims 2x Productivity by Teaching Claude Code to Do Less

The 'Last Mile' Problem: Why Most AI-Built Apps Never Reach Production

Cookie Preferences