Open Source Notable

Llama.cpp Adds Reasoning Budget Controls for Local AI Models

March 11, 2026 2 min read

Image: Meta

Llama.cpp, the open-source inference engine that lets you run AI models on your own hardware, just added reasoning budget controls - a feature that lets you limit how much "thinking" a model does before producing an answer.

This matters because modern reasoning models (like DeepSeek-R1, QwQ, and other chain-of-thought models) can spend a long time working through problems internally before responding. That's useful for hard math or complex coding problems, but wasteful when you're asking a simple factual question. Until now, running these models locally meant accepting whatever thinking time the model decided on.

The new feature adds a --reasoning-budget parameter that caps the number of tokens (chunks of text) a model can spend on its internal reasoning. Set it low for quick queries, high for complex tasks, or leave it unlimited for problems where you want the model to think as long as it needs to.

How It Works in Practice

The implementation hooks into llama.cpp's sampling pipeline to monitor and limit reasoning token generation. When the budget is hit, the model is forced to move from thinking to producing its final answer.

For anyone running reasoning models locally, this is a practical quality-of-life improvement. A model that would normally spend 30 seconds "thinking" about what day of the week it is can now be told to skip the deep reasoning and just answer. On limited hardware - especially older GPUs or CPU-only setups - that time savings adds up fast across dozens of queries.

The feature works with any GGUF-format reasoning model that uses thinking tokens, which covers most of the popular local reasoning models currently available.

Llama.cpp remains the backbone of the local AI movement, and features like this keep narrowing the usability gap between running models on your own machine and using cloud APIs. Budget controls are something the major API providers (OpenAI, Anthropic, Google) have been offering on their hosted reasoning models, so bringing the same capability to local inference is a logical and welcome addition.

How It Works in Practice

Related Tools

More from today

NVIDIA Releases Nemotron 3 Super: 120B Open Model That Runs on 12B Params

OpenClaw's Open-Source AI Agent Sparks Gold Rush in China

Open-Source Tool 'nah' Adds Context-Aware Permission Guards to Claude Code

Cookie Preferences