What happens when you train a language model on nothing but books, newspapers, and journals from before 1931?
Alec Radford - the researcher behind the original GPT model, the CLIP vision system, and Whisper speech recognition - just answered that question with Talkie, a 13 billion parameter model built alongside Nick Levine and David Duvenaud. A language model's parameters are the numerical weights learned during training; 13 billion puts Talkie in the mid-size range, comparable to early versions of Llama. What makes it unusual isn't the size but the data: every piece of text it learned from predates January 1, 1931. No internet. No Wikipedia. No knowledge of World War II, penicillin, or the transistor.
A Controlled Experiment in What LLMs Actually Know
Every major language model today was trained on overlapping datasets pulled from the modern web. GPT-4, Claude, Gemini, Llama - they all absorbed Reddit threads, Wikipedia articles, Stack Overflow answers, news sites, and digitized books. That shared origin creates statistical similarities that are hard to measure but real: they know roughly the same things, because they absorbed the same cultural baseline.
Talkie breaks that. The training corpus has a hard cutoff - nothing after December 31, 1930 was allowed in. The result is a model that can't have pattern-matched on modern answers to modern questions, because those events didn't exist in its data. For researchers, that makes Talkie valuable as a baseline: if you want to study what language models actually learn versus what they've memorized from common internet text, having a model with clean, verifiable training boundaries is useful.
To evaluate Talkie's outputs, the team used Claude Sonnet as a judge - checking whether responses stayed within the expected scope of pre-1931 knowledge and language. Using a modern frontier model to evaluate a deliberately historical one is a practical necessity right now. Automated tools for testing temporal accuracy in language models don't really exist yet.
Beyond the Novelty
For most people building AI applications, Talkie isn't something you'll deploy. But the questions it raises matter for the field. As AI gets embedded in legal research, archival analysis, and historical document work, understanding how training data shapes what a model "knows" becomes practical rather than theoretical.
There's also a simpler point: when every major model shares the same training ancestry, isolating cause and effect is hard. Talkie gives researchers something genuinely different - a language model whose knowledge has clean edges. Radford's previous work had lasting influence on the field, and the underlying question here is a real one: what does training data diversity actually do to model behavior?