Models

AI Coding Test: Claude Scores +854 While ChatGPT Hits -74,383

March 20, 2026 2 min read

Image: OpenAI

Four frontier AI models walked into a coding challenge. Only one read the instructions.

The test was straightforward: each model wrote a Python bot for Robot Word Racer, a competitive game where bots search 15x15 letter grids for valid words within 10 seconds. Words must be at least three letters long and traced through adjacent tiles. The scoring formula was spelled out explicitly: points equal the number of letters minus six. That means a three-letter word costs you three points. A seven-letter word earns one.

The Scoreboard Says Everything

Bot	Round 1	Round 2	Round 3	Total
Claude (Opus 4.6)	+258	+324	+272	+854
Gemini (Pro 3.1)	0	0	0	0
Grok (Expert 4.2)	0	-1,477	-43	-1,520
ChatGPT (GPT-5.3)	-24,283	-24,025	-26,075	-74,383

ChatGPT's bot submitted roughly 12,000 short words per round. Every single one cost points. Across three rounds, it racked up negative 74,383 points. That is not a typo.

One Model Did Basic Arithmetic

Claude's bot set a minimum submission length of seven letters, the exact breakeven point under the scoring formula. It then went further: iterative depth-first search, binary search prefix pruning, three-thread pipelining to solve and send words simultaneously, and TCP_NODELAY to reduce network latency. The result was a comfortable positive score in every round.

The other three models all made the same fundamental mistake. They confused the grid's minimum word length (three letters) with what would actually score points. None of them checked whether submitting a short word was profitable. Gemini's bot used a synchronous send-then-wait pattern that throttled its output so badly it effectively submitted nothing, which ironically made it the second-best performer at zero points. Grok landed in the middle with inconsistent negative scores.

What This Actually Tests

This is not a general intelligence benchmark. It tests one specific thing: whether a model carefully reads a specification and adjusts its strategy accordingly. The scoring formula was right there in the prompt. Claude applied it. The others ignored it.

That distinction matters for real-world coding work. When you ask an AI to build something, the spec is everything. A model that skims past a penalty clause or misreads a business rule will produce code that technically runs but fails where it counts. ChatGPT's bot worked perfectly as a word finder. It just lost 74,000 points doing it.

Small sample size, one task type, and no repeated trials with temperature variation. Take the rankings with appropriate skepticism. But the specification compliance gap is a pattern worth watching across more structured tests.

The Scoreboard Says Everything

One Model Did Basic Arithmetic

What This Actually Tests

Related Tools

More from today

Runway and Nvidia Demo Real-Time AI Video Generation Under 100ms

Nvidia's NemotronH Silently Rewrites Answers Instead of Refusing Them

How One Company Cut LLM Costs by Upgrading to a More Expensive Model

Cookie Preferences