Gemma 4 Now Runs Stable on Llama.cpp After Key Bug Fixes

Meta Llama
Image: Meta

The Gemma 4 rough edges in llama.cpp are gone. A pull request merged into the codebase this week resolves all known stability problems with running Google's Gemma 4 models locally - meaning you can now run the 31B parameter version without the crashes and garbled output that plagued earlier builds.

Llama.cpp is the open-source inference engine that lets you run large language models on your own hardware - no cloud subscription, no API costs. Q5 quantization is a compression technique that shrinks a model's file size and memory requirements by reducing numerical precision, trading a small quality hit for dramatically lower hardware demands. The 31B model at Q5 runs on a 24GB consumer GPU, or across CPU/GPU combinations on most modern workstations.

One setup detail to get right: launch with the --chat-template-file flag pointing at the interleaved chat template, which lives in the models/templates directory inside the llama.cpp repo. Skip it and you'll get strange output even on an otherwise stable build. The template ships with the current codebase, no external downloads needed.

If you tried Gemma 4 locally a few weeks ago and bailed due to instability, this week's build is worth another look.