Google's Gemma 4 Hits Compatibility Wall with Popular Local AI Tools

Google DeepMind
Image: Google

Running AI models on your own hardware instead of paying for cloud API calls has become a serious hobby - and increasingly, a serious workflow. Two tools sit at the center of that local AI stack: Unsloth, which makes fine-tuning models (customizing them with your own data) faster and more memory-efficient, and llama.cpp, the C++ engine that lets you run large language models on consumer hardware by compressing them through a process called quantization.

Gemma 4, Google's latest open-weight model, apparently doesn't play nice with either.

The Problem

Multiple users are reporting that Gemma 4 produces garbled or broken output when processed through the Unsloth-to-llama.cpp pipeline that works reliably for most other models. The issue appears to surface during quantization - the step where a model's full-precision weights get compressed into smaller formats (like 4-bit or 8-bit) so they can fit on a consumer GPU with 8-24GB of VRAM instead of requiring enterprise hardware.

This is a familiar pattern. New model architectures sometimes introduce changes to attention mechanisms, tokenizers, or tensor layouts that existing tools haven't been updated to handle. Llama.cpp and Unsloth both need to explicitly add support for each model architecture, and when a model ships with undocumented structural changes, things break.

Who This Affects

Anyone planning to run Gemma 4 locally should hold off on quantized versions until the toolchain catches up. The full-precision model may still work through other inference engines like vLLM or Hugging Face Transformers, but those require significantly more VRAM - typically 40GB+ for a model this size.

This is particularly frustrating because Google positions the Gemma family as its open-source play against Meta's Llama models. Llama 3 and its variants work almost seamlessly with the Unsloth/llama.cpp stack. If Gemma 4 can't match that compatibility, many local AI enthusiasts will simply stick with Llama.

The Unsloth and llama.cpp maintainers tend to move fast on compatibility fixes for popular models, so this will likely get resolved within days to weeks. But for now, if you're in the local LLM space and eyeing Gemma 4, wait for updated GGUF quantizations (the file format llama.cpp uses) before downloading anything.