Open Source

Luce Spark Runs a 35B AI Model on a 16GB GPU Without the Speed Penalty

June 8, 2026 2 min read

35 billion parameters on a single consumer GPU with 16GB of VRAM, running at usable speeds. That's the specification behind Luce Spark, a new open-source model built on a Mixture of Experts architecture.

MoE (Mixture of Experts) works differently from standard dense models. Instead of activating the entire network for every response, the model routes each input through a subset of specialized "expert" layers. Only a fraction of the total parameters run at any given moment, which is how a 35B MoE fits on hardware that would struggle with a conventional 35B dense model.

The more notable claim is running without the "offload tax." Offloading happens when a model's weights exceed what fits in GPU memory: the system constantly shuffles layers between the GPU and much slower CPU RAM during generation. That movement typically drops output speed from a reasonable 10-15 tokens per second (roughly 8-12 words per second) down to 2-3, making the model frustrating to use for anything beyond short queries. Luce Spark's architecture is designed to avoid that bottleneck on 16GB cards.

For people running local AI setups - whether for privacy, offline workflows, or avoiding per-token API costs - a comfortably fast 35B-class model on a mid-range consumer card is a step forward from what was accessible a year ago. In practice, 16GB VRAM has comfortably handled 7B or 13B dense models. Getting into 34B-class capability without expensive hardware or a painful speed penalty is the meaningful part of this release.

The model is available for download and testing.

Related Tools

More from today

Gemma 4 Chat Template Updated With Preserve Thinking Support

Active Malware Campaign Targets Claude Code Users via Compromised npm Packages

AI Coding Agents Write Code Well. Reading Your Codebase Is Another Story

Cookie Preferences