Related ToolsCursorClaude CodeCodyContinueAider

Llama.cpp Update Boosts Qwen3.5 and Qwen-Next Token Generation by Up to 60%

Meta Llama
Image: Meta

What Happened

A recent pull request (PR #19375) merged into llama.cpp delivers a substantial token generation speedup for Qwen3.5 and Qwen3-Coder-Next models. The fix targets a compute graph rework that eliminates unnecessary tensor copies and improves how backend kernels handle these specific architectures.

The numbers are concrete and verified across multiple hardware configurations:

  • Dual GPU (RTX 6000 Ada + RTX PRO 6000 Blackwell): Token generation jumped from 87.35 t/s to 118.63 t/s, a 1.36x improvement. Prompt processing went from 2,470 t/s to 2,770 t/s.
  • Single RTX PRO 6000 Blackwell: Token generation went from roughly 80 t/s to 132 t/s, nearly a 60% increase.
  • NVIDIA DGX Spark (MXFP4 MoE variant): 34.88 t/s to 45.93 t/s, a 1.32x speedup.
  • Apple M2 Ultra: Token generation improved from 33.75 t/s to 43.78 t/s (1.30x), with prompt processing jumping from 1,047 t/s to 1,338 t/s (1.28x).

All benchmarks used Qwen3-Coder-Next 80B at various quantization levels (Q4_0, Q8_0, MXFP4). The fix also enables CUDA graphs for Qwen3-Next-style architectures, introduces adaptive CPU-GPU interleaving on Metal, and benefits Vulkan and GGML backends.

This addresses a known performance gap. Prior to the fix, llama.cpp was roughly 40% slower than vLLM on Qwen-Next models and running at one-third the speed of MLX on Apple Silicon, according to GitHub issues #19345 and #19366.

Why It Matters

If you run Qwen models locally, this is a "stop what you are doing and update" moment. A 30-60% token generation improvement means the difference between a model feeling sluggish and feeling responsive during interactive use.

Qwen3.5 and Qwen3-Coder-Next have become popular choices for local deployment. The 35B-A3B mixture-of-experts variant of Qwen3.5 is particularly attractive because it runs well on consumer hardware while punching above its weight class. Qwen3-Coder-Next 80B is one of the strongest open coding models available. Both were being held back by a suboptimal compute graph in llama.cpp.

The cross-platform nature of this fix matters too. Whether you are on NVIDIA CUDA, Apple Metal, or AMD Vulkan, you get the improvement. That is unusual - most llama.cpp performance patches target a single backend.

For anyone who switched to vLLM or MLX because of the speed gap, this closes much of the distance. Llama.cpp's advantage has always been its portability and quantization support. Now it is no longer paying a steep performance tax for Qwen architectures.

Our Take

This is how open-source inference should work. A performance regression gets flagged by users, documented in GitHub issues with real benchmarks, and fixed with a targeted PR. The turnaround from "llama.cpp is 40% slower than vLLM on Qwen-Next" to "here is a 36% speedup" is exactly what makes llama.cpp the backbone of local AI.

The practical takeaway: if you are running any Qwen3.5 or Qwen-Next model through llama.cpp, update immediately. Pull the latest build. The improvement is real and requires zero configuration changes on your end.

For those choosing between local inference backends, this reinforces llama.cpp's position as the default choice. It may not always be the fastest on day one for new architectures, but the community catches up quickly, and the breadth of hardware support is unmatched. vLLM remains better for high-throughput server deployments. MLX is still faster on Apple Silicon for supported models. But llama.cpp is the only option that works well everywhere, and with this fix, the gap on Qwen models is now much smaller.

One thing to watch: Qwen3.5's smaller MoE variants (like the 35B-A3B) should see similar improvements, making them even more viable on a single consumer GPU.