$600 to $700 billion. That's how much the AI industry plans to spend on cloud infrastructure in 2026 alone, with a cumulative $3 trillion projected by 2028. A detailed new analysis argues much of that spending may be aimed at the wrong target.
The thesis: five converging technologies are pushing AI inference - the process of actually running a trained model to get answers - off expensive cloud servers and onto cheap, local hardware. If the trajectory holds, running a frontier-class AI model could cost less than a game console within five to seven years.
The Five Technologies Making This Possible
Three are already shipping in production systems:
Mixture-of-Experts (MoE): Instead of activating every parameter for every query, MoE models route each request to specialized sub-networks. DeepSeek-V3, for example, has 671 billion parameters but only fires 37 billion of them per query. You get big-model quality with small-model compute costs.
Multi-head Latent Attention (MLA): A compression technique that shrinks the memory footprint (called the key-value cache) by 93-98% compared to standard designs. Less memory means cheaper hardware.
Post-training quantization: Reduces the precision of model parameters from 16-bit numbers to 4-bit or lower. The result: 98-99% of the original quality at a quarter of the memory cost. This is why quantized versions of Llama 70B now run on $2,000 consumer GPUs, down from the $55,000 servers required in 2024.
Two more are promising but unproven at scale:
BitNet b1.58 (ternary training): Instead of storing each parameter as a decimal number, this approach uses just three values: -1, 0, and +1. Microsoft demonstrated 71% energy reduction at 3-8 billion parameter sizes, but nobody has published results at frontier scale (70 billion parameters and above).
Matmul-free transformers: Replace the computationally expensive matrix multiplication at the heart of AI models with simpler operations. Early tests show 61% lower energy consumption, but again, only at small scales.
The $60 Device
Here's where it gets concrete. Ternary inference doesn't need cutting-edge chip fabrication. It can run on 28-nanometer manufacturing lines - technology that's been around for over a decade and is already operating at scale in China. The estimated bill of materials for a local inference device: a RISC-V processor with ternary accelerator ($8-20), 64GB of commodity memory ($30-60), plus miscellaneous components ($20-35). Total: $60 to $115.
That price point changes the economics fundamentally. AI stops being a monthly subscription you rent from a cloud provider and becomes a one-time hardware purchase, like a calculator or a laptop.
The Infrastructure Bet That Might Go Wrong
The geopolitical angle is worth noting. US export controls target advanced chip fabrication, but ternary inference doesn't need advanced fabrication. The restrictions on training hardware may actually be accelerating efficiency research that makes inference cheaper, creating the opposite of the intended effect.
More practically, the $3 trillion infrastructure buildout assumes centralized inference demand keeps growing. If local devices start handling routine tasks - writing, coding, summarization - that demand redistributes instead. The analysis draws a direct parallel to the fiber optic buildout of the late 1990s, where massive infrastructure investment met a bust in 2000-2002 when demand projections didn't materialize as expected.
The timeline laid out: quantized models running locally at 80-95% of cloud quality by 2026-2028 for everyday tasks, with a fork point around 2028-2030 depending on whether ternary training scales to frontier models. If it does, the shift to local devices accelerates fast. If it plateaus, the transition stretches to a decade or more.
For anyone paying for AI subscriptions today, the practical takeaway is straightforward. The per-token costs you're paying now will keep dropping. Local options that don't phone home to anyone's cloud are getting better every quarter. The question isn't whether AI inference goes local - it's how much of it and how fast.