Research

Can an LLM Trained Only on Pre-1900 Text Rediscover Modern Physics?

April 3, 2026 3 min read

What happens when you train an AI model on everything humanity knew before 1900, then ask it to figure out quantum mechanics on its own?

That's the question behind Machina Mirabilis, an experiment by researcher Michael Hla that tests something fundamental about how large language models work: can they actually reason their way to new knowledge, or are they just very sophisticated pattern matchers?

The Setup

Hla trained a 3.3 billion parameter model (relatively small by today's standards, where frontier models like GPT-4 are estimated at over a trillion parameters) on roughly 22 billion tokens of pre-1900 text. The training data came from institutional book collections, British Library archives, and American newspaper databases. Any reference to Einstein, quantum mechanics, relativity, or other post-1900 physics was aggressively filtered out.

The model then went through a midtraining phase on about 2,600 physics texts from scientists like Maxwell, Newton, and Faraday - all pre-1900. Finally, it was instruction-tuned (trained to follow prompts and answer questions) on about 53,000 question-answer pairs.

Then came the test: present the model with experimental observations that historically led to quantum mechanics and relativity, and see if it could make the same conceptual leaps that Planck, Einstein, and others made in the early 1900s.

Flashes of Insight, But No Real Understanding

The model faced four challenges: explaining the ultraviolet catastrophe (why classical physics fails to predict radiation at high frequencies), the photoelectric effect (how light knocks electrons off metal), special relativity, and general relativity.

There were genuinely interesting moments. The model occasionally declared that "light is made up of definite quantities of energy" - which is essentially what Planck proposed in 1900. It sometimes recognized that continuous wave theory couldn't explain the photoelectric effect. It even suggested that gravity and acceleration are "locally equivalent," a core insight of general relativity.

But Hla is blunt about the limitations: these flashes were inconsistent, often surrounded by incoherent reasoning, and lacked the kind of step-by-step logical chains that actual scientific discovery requires. The model could generate text that sounded like a breakthrough without demonstrating any coherent understanding behind it.

Pattern Matching vs. Reasoning

The most likely explanation, according to Hla, is that the model was doing what LLMs do best: generating plausible-sounding text based on statistical patterns. Even with post-1900 physics scrubbed from the training data, the pre-1900 texts still contain the conceptual building blocks and vocabulary that make certain phrases statistically likely in physics contexts.

Hla notes that frontier-scale models (hundreds of billions or trillions of parameters) trained on the same restricted data might perform differently. But for a 3.3 billion parameter model, the experiment suggests that producing text that resembles scientific insight is very different from actually having it.

For anyone using AI tools daily, this is a useful gut check. LLMs are powerful for summarizing, drafting, and pattern recognition. But when you need genuine novel reasoning - connecting dots that weren't connected in the training data - current models still have a fundamental gap between sounding right and being right.

The Setup

Flashes of Insight, But No Real Understanding

Pattern Matching vs. Reasoning

Related Tools

More from today

Claude AI Discovers Remote Code Execution Bugs in Vim and Emacs

Adafruit Tests Show LLMs Reproduce Open-Source Code Verbatim at High Rates

ETH Zurich Study: LLMs Can Identify Anonymous Users for $4 a Person

Cookie Preferences