LLaMA.cpp Gets Multi-Token Prediction: 40% Faster Gemma 4 Generation

Meta Llama
Image: Meta

40% faster. That's the speedup local AI runners are seeing with a new Multi-Token Prediction (MTP) implementation for LLaMA.cpp when running Gemma 4.

LLaMA.cpp is the open-source inference engine most people use to run AI models on their own hardware - laptops, desktops, local servers - without sending data to any cloud service. MTP is a technique where the model predicts several output tokens (the basic text units a model generates) simultaneously instead of one at a time. Since generating one token per step is the main bottleneck in local AI speed, predicting 2-4 tokens per step cuts the waiting time significantly.

The 40% figure applies specifically to Gemma 4, Google's open-weights model released in April 2026. Other models will see different gains depending on how they were trained - MTP only delivers its full benefit when a model was trained with multi-token prediction in mind. Gemma 4 was, which explains why the speedup is this pronounced.

For anyone running local models for coding, writing, or document work, this kind of gain changes the day-to-day feel. A model that was borderline too slow for interactive use might now feel responsive enough to actually rely on. The change requires no different hardware or model files - it's a software-level optimization inside the inference engine itself.

This is a community contribution to LLaMA.cpp rather than an official release from any AI lab, which is how most meaningful progress in local AI moves. No announcement, no press release - just working code shipping to the repo.