Research Notable

Someone Got a 400-Billion Parameter AI Model Running on an iPhone 17 Pro

March 23, 2026 2 min read

Most AI models running on phones top out around 3 to 9 billion parameters. Apple Intelligence uses a roughly 3-billion-parameter model. So when the open-source project ANEMLL demonstrated Alibaba's Qwen 3.5 model - all 397 billion parameters of it - running locally on an iPhone 17 Pro with no internet connection, the number alone turns heads. That is roughly 50 times larger than anything that normally runs on a smartphone.

The catch: it produces text at 0.6 tokens per second, or about one word every two seconds. You would not want to have a conversation with it. But as a proof of concept for where on-device AI is headed, it is genuinely interesting.

How You Fit 400B Parameters on a 12GB Phone

The trick is that you don't. The Qwen 3.5-397B model uses a Mixture-of-Experts (MoE) architecture, meaning it has 512 specialized "expert" sub-networks per layer but only activates a handful of them for each word it generates. Out of 397 billion total parameters, only about 17 billion are active at any given moment.

The Flash-MoE inference engine, originally built by developer Dan Woods in a 24-hour sprint using Claude Code, exploits this by streaming only the needed experts from the phone's SSD on demand. The full model takes up roughly 163GB on disk at 3-bit quantization (a compression technique that shrinks model weights at a small cost to accuracy). The resident memory footprint - the part that stays loaded in the iPhone's 12GB of RAM - is only about 5.5GB for the routing logic and shared weights.

The engine is written in pure C, Objective-C, and Metal (Apple's GPU programming language) with about 1,200 lines of custom GPU shader code. No Python, no heavyweight frameworks. When the model needs an expert, Flash-MoE fires off parallel read calls to the SSD, pipes the data to the GPU, and starts computing while it is already loading the next layer's experts.

The Bigger Picture for On-Device AI

This approach draws directly from Apple's own 2023 research paper "LLM in a Flash," which proposed storing model weights in flash storage and pulling them into memory on demand. Flash-MoE is effectively the first open-source implementation that delivers usable (if slow) speeds on consumer hardware.

The numbers scale up fast on beefier machines. On a MacBook Pro M3 Max with 48GB RAM, the same model runs at 4 to 5 tokens per second. On an M5 Max with 128GB, it hits nearly 15 tokens per second - fast enough for actual use.

For iPhone users, this is not practical today. But phone SSDs are getting faster with every generation, and MoE architectures are becoming the standard design for frontier models (DeepSeek, Qwen, and Mixtral all use them). The gap between "technically possible" and "actually useful" on mobile is closing. Two years ago, running any LLM on a phone was a novelty. Now the question is not whether phones can run large models, but how large and how fast.

How You Fit 400B Parameters on a 12GB Phone

The Bigger Picture for On-Device AI

Related Tools

More from today

Anthropic's Physicist Used Claude to Write a Real Research Paper in Two Weeks

Study of 134,000 Legal AI Queries Shows Lawyers Still Outperform

The 12 Writing Tics That Instantly Mark Your Text as AI-Generated

Cookie Preferences