Related ToolsChatgptClaude

A 397B Parameter Model Now Runs on a MacBook With 48GB RAM

AI news: A 397B Parameter Model Now Runs on a MacBook With 48GB RAM

A 397-billion parameter AI model running on a laptop with 48GB of RAM should not work. Models that size typically need server hardware with hundreds of gigabytes of memory. But developer Dan Woods built a system called flash-moe that pulls it off, running Qwen3.5 397B on an M3 MacBook Pro at 5.7 tokens per second.

That is slow compared to cloud APIs, but fast enough to hold a conversation, and the fact that it works at all on consumer hardware is the real story.

How It Fits in 48GB

Qwen3.5 397B is a Mixture of Experts (MoE) model. Unlike a standard "dense" model where every parameter activates for every response, MoE models split their knowledge across dozens of specialized sub-networks called experts and only activate a handful of them for each token generated. Qwen3.5 397B has 397 billion total parameters but only uses a fraction of them at any given moment.

Flash-moe exploits this architecture by combining two ideas. The first comes from Apple's "LLM in a Flash" research paper, which showed how to intelligently swap model weights between storage (your SSD) and RAM, loading only the pieces needed right now. The second is a predictive layer that guesses which experts will be needed next and pre-loads them before they are called, cutting down on wait time.

Woods used Andrej Karpathy's autoresearch tool to iteratively develop and refine the approach. The result is an open-source harness that manages memory like a traffic controller, shuttling expert weights in and out of RAM fast enough to maintain usable generation speed.

What 5.7 Tokens Per Second Actually Means

For context, ChatGPT and Claude typically stream responses at 30-80 tokens per second. At 5.7 tokens per second, you are looking at roughly 4-5 words appearing each second. That is readable but noticeably slower than cloud services. Woods says the math suggests 18 tokens per second is achievable on his hardware with further optimization, which would put it in comfortable conversational range.

The tradeoff is obvious: you get full privacy, zero API costs, and access to a frontier-class model, but you give up speed and need a relatively high-end MacBook. The 48GB M3 MacBook Pro starts around $2,400, so this is not exactly budget hardware, but it is a fraction of what GPU server rental costs over time.

Local AI Keeps Getting More Practical

Six months ago, running anything above a 70B model locally required serious GPU hardware or aggressive quantization that visibly hurt quality. Flash-moe represents a different path: instead of shrinking the model, you get smarter about which parts of a large model need to be in memory at any moment.

The flash-moe code is available on GitHub for anyone to try. If you have an Apple Silicon Mac with 48GB or more RAM, this is the largest model you can run locally today at usable speeds. For developers and researchers who need to keep data off cloud servers, that is a meaningful option that did not exist a few months ago.