Related ToolsChatgptClaudeGemini

Qwen3.5-35B Hits 115 Tokens/Sec on NVIDIA GB10 With New Atlas Image

NVIDIA AI
Image: NVIDIA

What Happened

A team in the LocalLLaMA community released an optimized Atlas image that runs Qwen3.5-35B on NVIDIA's GB10 DGX Spark at approximately 115 tokens per second. The image was shared on Reddit on March 7, 2026, following an earlier post that generated significant community interest.

The performance numbers are striking. Prefill latency has been minimized, and time-per-output-token (TPOT) with multi-token prediction (MTP) is fast enough that, according to the developers, you cannot read the output as it generates. The team reports these speeds on a single GB10 unit running a 35-billion parameter model.

The project grew from community response to the initial post. The developers noted hardware offers, people showing up with 4-node clusters ready to test, and enough engagement to keep pushing the optimization work forward. This is a community-driven effort to squeeze maximum performance out of NVIDIA's desktop-class AI hardware.

Why It Matters

Running a 35B parameter model at 115 tokens per second on a single desktop unit changes the math on local inference. That speed is comparable to what cloud APIs deliver, but with no per-token costs, no rate limits, and complete data privacy.

For practitioners who process sensitive documents, run batch workloads, or simply want predictable costs, local inference at this speed makes cloud APIs optional rather than necessary. A 35B model like Qwen3.5 is large enough to handle most productivity tasks - summarization, code generation, analysis, drafting - without the quality compromises that come with smaller models.

The GB10 DGX Spark is not cheap hardware, but it is desktop hardware. This is not a rack-mounted server in a data center. It sits on a desk and runs models that required cloud infrastructure a year ago.

Our Take

The local LLM space keeps closing the gap with cloud APIs, and this benchmark is one of the more convincing data points. 115 tokens per second on a 35B model is not just fast for local - it is fast, period. Most cloud endpoints do not consistently beat that.

What makes this interesting is the combination of model quality and speed. Qwen3.5-35B is a capable model. Running it at speeds where the bottleneck is human reading speed rather than inference speed means local deployment is no longer a compromise - it is a legitimate alternative to API calls for many use cases.

The caveat is hardware cost. The DGX Spark is a purpose-built AI workstation, not a consumer GPU you slot into an existing PC. For individual users, cloud APIs are still more accessible. But for teams or companies processing high volumes of text, the break-even point where owning the hardware beats paying per token is getting closer.

The community-driven nature of this work is also worth noting. NVIDIA builds the hardware, but the community is building the optimized images and pushing the performance envelope. That ecosystem energy around local inference is not slowing down.