Related ToolsAiderContinueCody

Gemma 4 26B Runs at 7 Tokens/Sec on a $150 CPU-Only Desktop

AI news: Gemma 4 26B Runs at 7 Tokens/Sec on a $150 CPU-Only Desktop

$150. That's the approximate cost of the hardware running this particular test - a used desktop with an Intel i5-8500 processor from 2018, 32GB of RAM, and no dedicated GPU. On it: Google's Gemma 4 26B model, running via KoboldCpp on Linux at around 7 tokens per second.

Seven tokens per second is roughly fast-typing speed. It's usable for real work: drafting text, answering questions, coding help. And the model producing that output isn't a cut-down version of something larger - Gemma 4 26B is a competitive general-purpose model that benchmarks well against models that normally require dedicated AI hardware costing hundreds or thousands of dollars.

Why an Old CPU Can Handle a 26B Model

The answer is in the model's name. The "A4B" designation means 4 billion active parameters - this is a Mixture-of-Experts (MoE) architecture. Instead of routing every piece of computation through all 26 billion parameters each time it generates a word, this model splits its knowledge across dozens of specialized sub-networks (the "experts"). For any given token, only 4 billion parameters activate.

The result is a model that holds 26B parameters worth of knowledge in memory but does computation equivalent to a 4B dense model. Your CPU only has to crunch 4 billion numbers per token, not 26 billion. That's why it runs at speeds that would be impossible for a true 26B dense model on the same hardware.

Running it does require all 26B parameters loaded in RAM at once. At 4-bit quantization - a compression technique that reduces precision slightly to cut file size roughly fourfold - that's around 13-14GB of memory, comfortably within 32GB. The computation is light; you just need the memory capacity.

What Changes for People Running Local Models

The GPU-free local AI story has been improving steadily, but most CPU-runnable models had a practical ceiling around 12B parameters before speeds became too slow for daily use. Models in that range handle plenty of tasks but sometimes struggle with complex reasoning or wide knowledge recall.

Gemma 4 26B lands above that ceiling. At 7 tokens/second, it's generating output faster than most people read, and the quality reflects a 26B-class model. The model is available through Google's model hub, and KoboldCpp handles it with no special configuration on Linux.

This won't replace GPU setups for heavy workloads - batch processing hundreds of documents at 7 tokens/second is genuinely slow. But for a personal assistant running locally, offline, on hardware you already own, it's a real option that didn't exist at this quality level six months ago.