Unsloth Releases Multi-Token Prediction GGUF Weights for Gemma 4

AI news: Unsloth Releases Multi-Token Prediction GGUF Weights for Gemma 4

Unsloth has published Multi-Token Prediction (MTP) GGUF weights for three sizes of Google's Gemma 4 - the 31B, 26B-A4B, and 12B instruction-tuned variants. The files are available on Hugging Face in Q8, F16, and BF16 precision levels.

GGUF is the standard format for running large language models on your own hardware, without a cloud API. MTP is a technique where the model predicts several future text chunks simultaneously instead of one at a time, which speeds up generation without requiring a hardware upgrade. The 26B-A4B is a mixture-of-experts variant - it has 26 billion total parameters but only activates roughly 4 billion when processing any given piece of text, making it considerably cheaper to run than a full-density 26B model.

The three precision options trade memory against quality. Q8 (8-bit quantization, where each model weight is stored at reduced precision to cut memory use roughly in half) is the practical pick for most local setups. F16 and BF16 store weights at full 16-bit precision with near-zero quality loss, but the hardware requirements jump substantially - the 31B at F16 needs around 60GB of VRAM or unified memory, which rules out most consumer GPUs. The Q8 version of the 12B fits comfortably in 16GB. Unsloth's MTP additions are a straightforward quality-of-life improvement for anyone already running Gemma 4 locally and wanting faster output.