Tools Notable

Apple M5 Max Ships to Early Buyers, Local LLM Benchmarks on the Way

March 11, 2026 2 min read

The first M5 Max machines are landing on doorsteps, and the question every local AI enthusiast is asking: how many parameters can this thing actually run?

Apple's latest top-end silicon matters to AI practitioners for one reason - unified memory. Unlike discrete GPUs where you're capped at 24GB on a consumer card, Apple's architecture lets the CPU and GPU share the same memory pool. The M4 Max topped out at 128GB, enough to run quantized 70B-parameter models (think Llama 3 70B) at usable speeds. Early spec sheets suggest the M5 Max pushes memory bandwidth further, which directly translates to faster token generation when running large models locally.

For anyone running local LLMs through tools like Ollama or llama.cpp, the real test is tokens per second at various model sizes. The M4 Max could manage roughly 10-15 tokens/s on a 70B model depending on quantization. Community benchmarks for the M5 Max should start appearing within days now that hardware is shipping.

The practical appeal is straightforward: run capable AI models on your desk without sending data to anyone's cloud. That matters for lawyers handling privileged documents, developers working with proprietary code, and anyone who just doesn't want their prompts on someone else's server.

No benchmark results yet, so hold off on any purchase decisions. The numbers that matter most are tokens/s at Q4 and Q6 quantization for 70B and 100B+ models, and whether the memory bandwidth bump makes the jump from M4 Max worth the price premium. We'll cover the results once real-world testing data comes in.

Related Tools

More from today

A Data Engineer Stress-Tested Claude Code on a Real dbt Project. Here's What Broke.

Benchmarks Show Claude Skills Don't Work the Same Across Opus, Sonnet, and Haiku

AI Coding's Real Bottleneck: Design Decisions, Not Code Generation

Cookie Preferences