Gemma 4 26B Hits 80-110 Tokens Per Second on a Single RTX 3090, But Tool Calling Breaks

AI news: Gemma 4 26B Hits 80-110 Tokens Per Second on a Single RTX 3090, But Tool Calling Breaks

80 to 110 tokens per second on a single RTX 3090. Tokens are chunks of text - roughly 75% of a word each - so 100 tokens/second translates to around 75 words of output per second. That's the generation speed some users are hitting with Google's Gemma 4 26B A3B model running locally in LM Studio.

The A3B designation signals a mixture-of-experts architecture, where only a portion of the model's 26 billion parameters are active at any given time. That selective activation is why it generates text quickly on hardware that would struggle with a traditional 26-billion-parameter dense model at comparable speeds.

The problem is tool calling. Tool calling is the model's ability to decide when to use external functions or APIs during a conversation - for example, triggering a web search or running a calculation. In multiple user setups, Gemma 4 26B A3B's tool-calling pipeline enters an infinite loop that requires a manual reset. The bug appears to be configuration-dependent; some users have resolved it by adjusting the model's system prompt format and restructuring how tool-call instructions are passed.

Qwen3.5 MoE - a competing open-source model at a similar parameter count - handles tool calling more cleanly on the same hardware, though it carries separate Windows 11 compatibility issues that have frustrated users on that platform.

For high-speed text generation on a single consumer GPU, Gemma 4 26B A3B is worth testing. For workflows that depend on tool integration, it currently needs more configuration patience than the alternatives require.