Open Source Notable

LLaMA.cpp Gets Multi-Token Prediction: 40% Faster Gemma 4 Generation

May 8, 2026 2 min read

Image: Meta

40% faster. That's the speedup local AI runners are seeing with a new Multi-Token Prediction (MTP) implementation for LLaMA.cpp when running Gemma 4.

LLaMA.cpp is the open-source inference engine most people use to run AI models on their own hardware - laptops, desktops, local servers - without sending data to any cloud service. MTP is a technique where the model predicts several output tokens (the basic text units a model generates) simultaneously instead of one at a time. Since generating one token per step is the main bottleneck in local AI speed, predicting 2-4 tokens per step cuts the waiting time significantly.

The 40% figure applies specifically to Gemma 4, Google's open-weights model released in April 2026. Other models will see different gains depending on how they were trained - MTP only delivers its full benefit when a model was trained with multi-token prediction in mind. Gemma 4 was, which explains why the speedup is this pronounced.

For anyone running local models for coding, writing, or document work, this kind of gain changes the day-to-day feel. A model that was borderline too slow for interactive use might now feel responsive enough to actually rely on. The change requires no different hardware or model files - it's a software-level optimization inside the inference engine itself.

This is a community contribution to LLaMA.cpp rather than an official release from any AI lab, which is how most meaningful progress in local AI moves. No announcement, no press release - just working code shipping to the repo.

More from today

Cloudflare Cuts 1,100 Jobs Citing AI Efficiency as Revenue Hits Record High

Claude Can Now Create Personal Podcasts That Save Directly to Your Spotify Library

Anthropic Trained Claude to Resist Blackmail - Here's How It Actually Worked

Cookie Preferences