Related ToolsChatgptClaude

Google's Gemma 4 Open Models Pack Multimodal Smarts Into Laptop-Sized Packages

Google DeepMind
Image: Google

Google just shipped Gemma 4, and the standout model is the 26B A4B variant: 26 billion total parameters, but only 4 billion active at any given time. That mixture-of-experts design (where different "expert" sub-networks handle different types of inputs, and only the relevant ones fire for each query) means it runs at roughly the same speed as models a fraction of its size while punching well above its weight on quality.

Early local testing shows the 26B A4B running at about 60 tokens per second on an Apple Mac Studio with an M1 Ultra chip, comparable to Qwen 3.5's 35B model at similar context lengths. But users report noticeably better behavior: more concise reasoning chains, less repetitive looping, and stronger visual understanding. One tester described Qwen's chain-of-thought as "inner gaslighting" compared to Gemma's more coherent step-by-step output.

The Full Lineup

Four models ship under the Gemma 4 banner:

  • E2B and E4B (2B and 4B effective parameters) - Built for phones, edge devices, and browsers. Use a technique called Per-Layer Embeddings where each decoder layer gets its own small embedding for every token, squeezing more capability out of tiny footprints.
  • 31B - A dense model (all parameters active all the time) for server deployments or beefy local machines. Needs about 58 GB of memory at full precision.
  • 26B A4B - The mixture-of-experts model. Needs about 48 GB at full precision, but quantized versions (compressed to use less memory at slight quality cost) fit in about 15.6 GB.

All four handle text, images at variable resolutions, video, and audio. Context windows range from 128K tokens for the small models to 256K for the larger ones (roughly 300 to 600 pages of text). All support function calling, where the model can invoke external tools and APIs as part of its response.

Who Should Care

The 26B A4B is the interesting one for most people running models locally. At 15.6 GB quantized, it fits on a high-end laptop or a Mac with 32 GB of unified memory. The multimodal support means a single local model can handle text conversations, image analysis, and even basic video understanding without sending data to an external API.

For developers building applications, the built-in function calling and system prompt support mean less custom scaffolding. And the 256K context window is large enough for most document analysis tasks.

Google has not published official benchmark comparisons against Qwen, Llama, or Mistral yet. Independent benchmarks from sources like Artificial Analysis are still pending. But the early hands-on reports are consistently positive, particularly around reasoning quality and visual tasks. If you have been waiting for an open multimodal model that is genuinely practical to run locally, Gemma 4's 26B variant is worth testing.