35 billion parameters on a single consumer GPU with 16GB of VRAM, running at usable speeds. That's the specification behind Luce Spark, a new open-source model built on a Mixture of Experts architecture.
MoE (Mixture of Experts) works differently from standard dense models. Instead of activating the entire network for every response, the model routes each input through a subset of specialized "expert" layers. Only a fraction of the total parameters run at any given moment, which is how a 35B MoE fits on hardware that would struggle with a conventional 35B dense model.
The more notable claim is running without the "offload tax." Offloading happens when a model's weights exceed what fits in GPU memory: the system constantly shuffles layers between the GPU and much slower CPU RAM during generation. That movement typically drops output speed from a reasonable 10-15 tokens per second (roughly 8-12 words per second) down to 2-3, making the model frustrating to use for anything beyond short queries. Luce Spark's architecture is designed to avoid that bottleneck on 16GB cards.
For people running local AI setups - whether for privacy, offline workflows, or avoiding per-token API costs - a comfortably fast 35B-class model on a mid-range consumer card is a step forward from what was accessible a year ago. In practice, 16GB VRAM has comfortably handled 7B or 13B dense models. Getting into 34B-class capability without expensive hardware or a painful speed penalty is the meaningful part of this release.
The model is available for download and testing.