Training a 100-billion-parameter AI model normally requires a cluster of expensive GPUs. A new research paper called MegaTrain describes a system that does it on a single GPU - by rethinking where the model actually lives during training.
Most training systems keep everything on the GPU: the model's weights (the numerical values that define how it responds), gradients (the error signals used to update those weights), and optimizer states (bookkeeping data that helps the training process converge). For a 100B+ parameter model, this alone requires hundreds of gigabytes of GPU memory. No single card holds that much. So teams typically distribute the work across dozens or hundreds of GPUs, each costing thousands of dollars.
MegaTrain stores parameters and optimizer states in CPU RAM - regular computer memory, which is cheaper and available in much larger capacities - instead of GPU memory. The GPU receives data one layer at a time, computes the necessary values for that layer, then hands the results back to CPU memory. The next layer streams in, and the process repeats.
Why Full Precision Matters
The paper specifically addresses full-precision training - using 32-bit floating point numbers rather than compressed lower-bit formats like 8-bit or 4-bit quantization (techniques that trade numerical accuracy for memory savings). Quantization is common in deployment but can introduce subtle quality degradation during training. Training large models at full precision, without that compromise, is the goal here.
At full 32-bit precision, a 100B parameter model needs roughly 400GB just for the weights. Optimizer states add another 800GB on top. Modern server CPUs can address that much RAM. A single GPU cannot.
The Bandwidth Tradeoff
The cost of this approach is communication overhead. Moving data between CPU and GPU travels through a PCIe bus, which is slower than the GPU's own internal memory. MegaTrain addresses this with prefetching - loading the next layer's data while the current layer is still computing - to keep the GPU occupied. Whether throughput holds up across full training runs is something independent researchers will need to verify.
The practical upside: if the system works as described, a researcher with a high-end workstation and enough CPU RAM could run training experiments that currently require data center hardware. Training at 100B+ parameter scale is effectively limited to organizations with significant GPU budgets. Changing that access picture is a meaningful outcome if the performance numbers hold.
The paper is available on arXiv. Papers posted there aren't peer-reviewed before publication, so treat the benchmarks as preliminary until independent replication confirms them.