llama.cpp Adds Thinking Mode Toggle with Adjustable Reasoning Effort Levels

Meta Llama
Image: Meta

A contributor named allozaur has submitted PR #23434 to llama.cpp, the popular open-source engine for running large language models locally. The change adds a Thinking mode toggle to llama.cpp's built-in chat interface, along with selectable reasoning effort levels.

Thinking mode - sometimes called chain-of-thought reasoning - is when a model works through a problem step-by-step internally before giving you an answer. Models trained to do this typically produce better results on complex tasks like coding, math, and multi-step analysis, but they take longer and consume more tokens (the units AI models use to process text). The effort levels in this PR let you dial how much reasoning the model does: lower effort is faster, higher effort is slower but more accurate for hard problems.

For anyone running local models through llama.cpp's built-in server interface, this is a genuine quality-of-life improvement. Until now, enabling thinking mode required manually editing API request parameters - not something most non-developers want to do mid-conversation. The PR also includes improvements to the chat form's "Add Action" section, which handles tool use configuration.

The practical impact depends entirely on which models you run. Thinking mode only does something useful for models specifically trained to reason this way - Qwen3 and DeepSeek-R1 variants being the main llama.cpp targets right now. Running a standard instruction-tuned model with the toggle on changes nothing. The PR is under review as of early June 2026 and has not yet merged into the main branch.