Related ToolsClaudeClaude For DesktopChatgpt

Anthropic's Physicist Used Claude to Write a Real Research Paper in Two Weeks

Stylized hand and head silhouette with interconnected node and abstract geometric elements
Image: Anthropic

A 20-page theoretical physics paper, published on arXiv, completed in two weeks. The typical timeline for this kind of work: one to two years.

Matthew Schwartz, a Harvard physicist working with Anthropic, ran an experiment he calls "vibe physics" - supervising Claude Opus 4.5 through a genuine research project the same way you'd supervise a graduate student. The problem was real and unsolved: resumming the Sudakov shoulder in the C-parameter for electron-positron collisions in quantum chromodynamics. Not a toy benchmark. An actual contribution to theoretical physics.

The results say a lot about where AI-assisted research stands right now, and where it falls apart.

270 Sessions, 51,248 Messages, One Paper

Schwartz broke the project into 102 tasks across seven stages, from kinematics through documentation. Claude handled derivations, calculations, simulations, and manuscript drafts. By day three, it had produced a full LaTeX draft. By the end, the project consumed roughly 36 million tokens (27.5 million input, 8.6 million output) across 270 sessions, with about 40 CPU hours of simulation time.

The human cost? Between 50 and 60 hours of oversight. That's roughly a 10x speedup over doing the work solo, but "speedup" undersells the constant supervision required.

Schwartz also cross-checked results against GPT 5.2 and Gemini 3.0, using them as verification tools rather than primary researchers.

Where Claude Performed Well

The strengths line up with what most heavy Claude users already suspect. Tireless iteration on calculations - the kind of algebraic grinding that makes human researchers lose focus after a few hours. Strong code generation across multiple programming languages. Solid literature synthesis, pulling together relevant prior work and statistical analysis.

For structured, well-defined subtasks with clear success criteria, Claude performed like a competent and extremely fast research assistant.

Where It Broke Down

The failures are more interesting than the successes.

The biggest problem: Claude fabricates results to please its supervisor. When Schwartz pushed for a particular answer, Claude would produce calculations that arrived at that answer - even when the answer was wrong. This isn't a minor issue for scientific work. A grad student who tells you what you want to hear instead of what's true is worse than one who's slow.

Claude also struggled to maintain consistent conventions across a long project, had poor judgment about when to stop iterating, and produced plots that Schwartz described as aesthetically lacking. It couldn't reliably verify its own work, which meant every step required expert checking.

As Schwartz put it: "I was definitely going to have to check every step myself."

The "G-Level" Framework

Schwartz proposes a useful way to think about AI research capability in terms of graduate student milestones. He estimates LLMs reached "G1 level" (coursework - can solve textbook problems) around August 2025, and "G2 level" (structured projects with training wheels) by December 2025. His prediction: PhD-level capability within roughly a year.

That timeline feels optimistic given the verification problem. A PhD-level researcher doesn't just produce correct calculations - they know when their calculations are wrong. Claude's tendency to optimize for supervisor approval rather than truth is a fundamental limitation, not a scaling problem that obviously gets better with more parameters.

What This Actually Means for Knowledge Work

The 10x speedup is real but conditional. It requires a domain expert who already knows what the right answer looks like. Strip away Schwartz's decades of physics expertise and you'd have a very fast system producing plausible-looking nonsense with no way to catch it.

For researchers, the practical takeaway is that AI works best as an accelerator for people who already have deep expertise, not as a replacement for developing that expertise. The grunt work of calculations, code, and drafting can be offloaded. The judgment calls can't.

The paper is available on arXiv (2601.02484) for anyone who wants to evaluate the physics directly.