Related ToolsClaudeChatgptCursorCodyAider

Google's Gemma 4 12B Runs Locally on Mac with Vision and Audio Support

Google DeepMind
Image: Google

Google just released Gemma 4 12B, a 12-billion-parameter open-source model that runs locally on Apple Silicon Macs and handles text, images, and audio in a single model. The license is Apache 2.0, meaning commercial use is unrestricted.

The specs: a 256K token context window (roughly enough to fit a 700-page book), multimodal input across three modalities, and hardware requirements that fit within a 16GB unified memory Mac. M2, M3, and M4 chips handle the full model without a dedicated GPU. For tighter setups, Q4 and Q5 quantized versions - compressed builds that trade a small amount of precision for a much smaller file size - drop the memory footprint further without a significant capability hit.

The 5-Minute Setup via Ollama

Ollama is the fastest path to running Gemma 4 locally. It's a free tool that handles model downloads and serving in the background. Install it from ollama.com, then run ollama pull gemma4:12b in a terminal. LM Studio works too if you prefer a graphical interface.

Audio Input Sets It Apart

Most local open-source models at this parameter count stop at text and images. Gemma 4's audio input adds transcription and audio analysis without sending files to an external API - useful when processing internal meeting recordings, sensitive voice data, or anything you'd rather not route through a third-party service.

Performance on coding and reasoning tasks is competitive for a 12B model. It won't replace frontier models for complex multi-step work, but for local-first workflows - reviewing private documents, batch processing without API costs, or offline use - a capable model at this weight class is a practical tool.

Google hasn't published detailed audio benchmarks yet. Community results over the next few weeks will fill that gap.