What Happened
On February 26, Hugging Face published a detailed technical post on Mixture of Experts (MoE) architecture support in its transformers library. The post covers both the conceptual underpinnings of MoE models and specific engineering improvements developed to handle recent releases including DeepSeek R1, Qwen 3.5, GLM-5, and Mixtral variants.
The most significant infrastructure change is a weight loading refactor that reduced load time for large MoE models by 3.2x. Qwen1.5-110B-Chat now loads in 20.7 seconds versus 66.2 seconds in the previous library version. The refactor introduced a WeightConverter system that handles the mismatch between checkpoint storage format (256+ separate expert tensors) and the packed tensor format required for efficient runtime execution.
Three inference backends are now selectable: an eager loop mode for debugging, a batched matrix multiply backend using torch.bmm suited for small batches on GPU, and a grouped matrix multiply backend using torch._grouped_mm optimized for large batches and memory-constrained settings. Expert parallelism across multiple GPUs was also added via a new GroupedGemmParallel module that distributes expert weights across devices and uses all-reduce for aggregation.
In collaboration with Unsloth, the library now achieves roughly 12x faster MoE training throughput while reducing VRAM usage by approximately 35%, enabling longer context windows and larger effective batch sizes on consumer hardware.
Why It Matters
MoE has become the dominant architecture for frontier models. DeepSeek R1's January 2025 release normalized the approach, and subsequent releases including Qwen 3.5, MiniMax M2, GLM-5, and Kimi K2 all use MoE variants. The key advantage is that MoE models activate only a subset of their parameters per token - GPT-OSS-20B activates roughly 3.6 billion parameters per inference step despite having 21 billion total parameters - enabling faster generation at lower compute cost.

If the transformers library handles these architectures inefficiently, it creates real friction for researchers who need to fine-tune or deploy them. Startup latency matters in production settings where model serving is billed by time, and a 3.2x load improvement translates directly to reduced cold-start costs for serverless deployment configurations.
The three-backend approach for expert computation reflects a real engineering tradeoff: no single execution strategy works optimally across all hardware configurations, batch sizes, and model sizes. Giving practitioners a dial rather than a fixed default reduces edge cases and makes the library more useful across heterogeneous deployment environments.
Our Take
This is infrastructure work that doesn't generate headlines but compounds meaningfully for the people who use these models professionally. The Unsloth collaboration's 12x training speedup figure has a documented track record behind it - that team has consistently delivered real efficiency improvements rather than benchmark-cherry-picked numbers. The grouped matrix multiply backend for large-batch scenarios is particularly relevant for teams running inference at scale, where the throughput difference between backends is substantial.
The WeightConverter design is worth understanding if you work with MoE models: the checkpoint-to-runtime format mismatch is an underappreciated source of loading latency, and the explicit converter pipeline makes debugging format issues much easier than the previous implicit handling.
