Open Source

Qwen3 Multi-Token Prediction Tested at 1M Tokens, Shows 1.5x Speed Boost

May 15, 2026 2 min read

Image: Alibaba Cloud

1.5 times faster. That's what one developer measured when testing Qwen3's new multi-token prediction (MTP) build across three coding sessions totaling more than a million tokens.

Multi-token prediction is a technique where the model generates several word fragments (tokens) at once rather than one at a time. Standard language models predict one token, then the next, sequentially. MTP breaks that bottleneck at the output stage, which can increase generation speed without changing the quality of the underlying model weights.

The test project was a mystery dungeon game built iteratively in Python using the pygame library - a practical, open-ended coding task that required the model to track decisions made hundreds of exchanges earlier. The context window was extended to 300,000 tokens (roughly the text of several novels held in memory simultaneously) to maintain consistency across a growing codebase. The tester ran Q4_0 quantization - a compression method that reduces memory requirements at some quality cost - and noted afterward they had accidentally left a lower setting than intended. A follow-up test at Q8_0, which preserves more model quality, is planned.

For people running AI locally on consumer hardware, generation speed is the constant trade-off against using a hosted service. A cloud API like ChatGPT responds in seconds regardless of your machine. A locally hosted 35B parameter model can generate at 10-20 tokens per second depending on your GPU, making longer interactions feel sluggish. A consistent 1.5x improvement from MTP would push that from "tolerable" toward usable in a real workflow without requiring hardware upgrades.

These results come from community testing on a single machine setup, not a controlled benchmark. Performance will vary based on GPU, available RAM, and quantization settings. The Q8_0 retest should clarify how much of the speed gain comes from the MTP technique itself versus the lower quantization setting's reduced memory pressure. The MTP builds for Qwen3 are available now for local deployment.

Related Tools

More from today

AI Spam Submissions Killed Turso's $1,000 Bug Bounty Program

Stanford Found a 31-Point Productivity Gap Between Agentic and Assisted AI

YouTube Expands AI Deepfake Detection to All Adult Users

Cookie Preferences