Xiaomi Claims 1,000 Tokens/Sec on a 1-Trillion-Parameter Model

AI news: Xiaomi Claims 1,000 Tokens/Sec on a 1-Trillion-Parameter Model

1,000 tokens per second. That's what Xiaomi is claiming for its new MiMo-V2.5-Pro UltraSpeed - a 1 trillion parameter AI model running on a single standard 8-GPU server.

For context: tokens are the chunks of text an AI model processes and generates (roughly a word or part of a word each). Most hosted AI services deliver 50-150 tokens per second for large models. Hitting 1,000 on a model this size would be a genuine step forward.

The MiMo model uses a Mixture of Experts architecture, which means it doesn't activate all 1 trillion parameters for every request - it routes each input to a specialized subset of the model. That makes large MoE models faster to run than dense models of equivalent size, but 1,000 tokens/sec on a trillion-parameter system is still a significant claim.

What makes this notable is the hardware context. Cerebras achieves extreme inference speeds using wafer-scale chips - essentially silicon the size of a dinner plate. Groq built custom chips that use on-chip SRAM memory rather than slower external DRAM. Both approaches involve purpose-built, expensive hardware. Xiaomi says it's hitting this speed on a conventional 8-GPU server of the type data centers already run.

Xiaomi hasn't published independent benchmarks or a formal technical paper to support the claim yet. Cautious interest is the appropriate stance until third-party verification arrives. The company has legitimate AI research credentials through its MiMo series, but extraordinary performance claims on standard hardware warrant scrutiny.

If verified, the practical implication for developers is clear: running very large models at high speed wouldn't require access to specialized infrastructure - just more of the same GPUs teams are already renting.