Tools

How to Run Claude Code Against a Local LLM (And What You're Giving Up)

May 23, 2026 2 min read

Image: Anthropic

Running Claudee Code](/tools/claude-code/) without paying Anthropic per token is possible, and a detailed community guide making the rounds this week explains exactly how to do it. The short version: Claude Code supports a custom API base URL, which means you can point it at any OpenAI-compatible endpoint, including a local model running through Ollama on your own machine.

The setup takes roughly 15 minutes. You install Ollama, pull a capable code-focused model (Qwen2.5-Coder, Devstral, or DeepSeek-Coder are the common picks), then set two environment variables before launching Claude Code: ANTHROPIC_BASE_URL pointing to your local Ollama server and ANTHROPIC_API_KEY set to any dummy string since authentication doesn't apply locally. Claude Code reads the base URL first and routes requests there instead of Anthropic's servers.

What Actually Works

For contained tasks - writing a function, explaining a code block, generating a test - local models perform reasonably well, especially the 32B+ parameter variants on machines with a decent GPU. The guide reports Qwen2.5-Coder-32B handling routine coding requests without obvious quality drops compared to Claude 3.5 Sonnet on simpler tasks.

Where it falls apart is multi-file reasoning and anything requiring the model to hold a lot of context in mind at once. Local models tend to lose track of dependencies across files, produce more hallucinated function names, and struggle with the kind of agentic back-and-forth (running a command, reading the output, deciding what to fix) that Claude Code was designed around.

The Real Cost Calculation

The pitch is zero API costs, but that framing is incomplete. A GPU capable of running a 32B model fast enough to feel usable (at least 24GB VRAM) costs $800 to $2,000 for a consumer card. Anthropic's API costs for heavy Claude Code usage typically run $50-150/month for a working developer. The hardware pays off around 12-18 months if you're running it constantly, and only if inference speed is acceptable for your workflow.

For developers on air-gapped machines, teams with strict data residency rules, or anyone experimenting with local AI tooling, the setup is worth knowing. For everyone else, the capability gap between a local 32B model and Claude Sonnet 4 is large enough that you'll notice it in daily use.

What Actually Works

The Real Cost Calculation

Related Tools

More from today

Anthropic's Code with Claude Showed AI Programming Without a Human in the Loop

Claude's Three Products Don't Share Context - Users Want That Fixed

Claude Desktop for Windows Now Warns You Before You Burn Your Context Window

Cookie Preferences