Unsloth Pushes Updated Gemma 4 GGUFs - Re-download Required

AI news: Unsloth Pushes Updated Gemma 4 GGUFs - Re-download Required

Running Gemma 4 locally? You need to re-download the GGUF files.

Unsloth, the team behind heavily optimized builds of open-source models, pushed updated GGUF versions of Google's Gemma 4 on April 8. GGUFs are quantized (compressed) model files designed to run on consumer hardware without a cloud connection. The two updated releases cover the 2B instruction-tuned model and the 26B MoE variant - MoE stands for mixture-of-experts, meaning the model has 26 billion total parameters but only activates 4 billion at a time during inference. That architecture makes it faster and cheaper to run than a traditional 26B model of the same size.

The update patches a KV-cache bug involving attention rotation for heterogeneous iSWA. In plain language: iSWA (interleaved sliding window attention) is a technique Gemma 4 uses to handle longer conversations more efficiently by only paying full attention to nearby tokens while using a compressed view of older context. The KV-cache stores that conversation state so the model doesn't recompute it from scratch on every reply. The bug meant cached data could produce incorrect outputs when attention rotation was applied across the model's mixed-layer architecture.

The Unsloth team flagged this as broad enough to warrant a full re-release rather than a patch. If you grabbed these files in the first wave after Gemma 4's release, head back to Hugging Face and pull the latest versions from the unsloth/gemma-4-E2B-it-GGUF and unsloth/gemma-4-26B-A4B-it-GGUF repos.

The 26B MoE variant is worth the re-download - it benchmarks competitively against much larger models while staying small enough for a mid-range GPU. The 2B is useful for edge deployments or cheap prompt testing before committing to larger runs.