What Happened
Donald Knuth, the legendary computer scientist behind The Art of Computer Programming, published a note on March 6 detailing how Claude Opus 4.6 solved an open combinatorics problem he had been working on for weeks.
The problem involves decomposing the arcs of a 3D Cayley digraph into three directed Hamiltonian cycles for all values of m greater than 2. Knuth had solved the case for m = 3 himself, and his colleague Filip Stappers had found empirical solutions for m = 4 through 16. But a general construction remained elusive.
Stappers posed the problem directly to Claude Opus 4.6 using Knuth's exact mathematical wording, along with instructions to document its progress after each exploration step. What followed was a methodical 31-step investigation that took about one hour.
Claude's approach was systematic. It reformulated the problem using Cayley digraph notation, tried brute-force DFS (too slow), identified a "serpentine pattern" related to the modular Gray code, developed a fiber decomposition framework, ran simulated annealing experiments, then concluded it needed "pure math." At exploration 30, it noticed a structural pattern in its earlier annealing results. By exploration 31, it had a working Python program that produced valid decompositions for all odd m from 3 to 101.
Knuth then proved the construction correct for all odd m, finding that Claude had discovered one of exactly 760 valid "Claude-like" decompositions.
The even case followed shortly after. Ho Boon Suan used GPT-5.3-codex to find a construction for even m >= 8, then GPT-5.4 Pro produced a complete 14-page proof with no human editing required. Kim Morrison formalized Knuth's odd-case proof in Lean within days. Researcher Keston Aquino-Michaels demonstrated that pairing GPT and Claude together as complementary agents produced yet another valid solution and a simpler even-case construction.
Why It Matters
This is not a toy demo or a contrived benchmark. Knuth is one of the most important computer scientists alive, and this was a real open problem from active research for a forthcoming volume of his life's work. He had spent weeks on it. Claude solved the core case in an hour.
What stands out is Claude's problem-solving process. It did not just pattern-match or brute-force. It reformulated the problem mathematically, tried multiple approaches, recognized dead ends ("SA can find solutions but cannot give a general construction. Need pure math"), and pivoted strategies. At one point it told itself: "don't think in fibers, think directly about what makes a Hamiltonian cycle." That is qualitatively different from what most people expect from LLMs.
The multi-model collaboration is also notable. Claude solved the odd case. GPT-5.3-codex cracked the even case. GPT-5.4 Pro wrote the proof. Lean verified it. Different models contributing different strengths to close out a single problem completely.
Our Take
Knuth does not hand out praise lightly. When he writes "I'll have to revise my opinions about generative AI" and calls this "definitely an impressive success story," that carries weight.
But the details matter as much as the result. Stappers had to restart sessions when Claude hit random errors, and he "had to remind Claude again and again" to document its progress. Claude eventually got stuck on the even case and "was not even able to write and run explore programs correctly anymore." These are real limitations that persist even at the frontier.
The most interesting implication might be Aquino-Michaels' multi-agent approach, using GPT and Claude together with complementary skills. Knuth flagged this as having "potentially significant implications for how new problems can be tackled." We are starting to see LLMs used not as single oracles but as collaborative research tools, each with different strengths, steered by humans who know when to push and when to pivot.
For anyone still dismissing LLMs as "just autocomplete," this paper is required reading. For anyone treating them as infallible, the failure modes Knuth documents are equally important.