Related ToolsClaude CodeCursorGithub CopilotChatgptClaudeCodyAider

Anthropic Study: Developers Who Used AI Assistants Scored 17% Lower on Coding Tests

Anthropic
Image: Anthropic

Anthropic published a study in late January that keeps resurfacing in developer circles, and the finding is uncomfortable: software engineers who used an AI coding assistant to learn a new programming library scored 17% lower on comprehension tests than those who worked without one.

The headline number is striking, but the details underneath it are more interesting.

The Study Setup

Researchers Judy Hanwen Shen and Alex Tamkin ran a randomized controlled trial with 52 professional software engineers. All participants had at least a year of weekly Python experience, and 55% had seven or more years of coding under their belts. None had prior experience with Trio, an asynchronous Python library used as the learning target.

Participants were split into two groups. The treatment group got access to GPT-4o through a chat interface. The control group got web search and written documentation only. Both groups had 35 minutes to complete two Trio programming tasks, followed by a 14-question quiz covering debugging, code reading, and conceptual understanding.

The Results

The AI-assisted group averaged 16 out of 27 points on the quiz (about 50%). The control group averaged 20.1 points (about 67%). That gap - 4.15 points - was statistically significant with a p-value of 0.01. The researchers described it as "the equivalent of nearly two letter grades."

The kicker: AI users only finished about two minutes faster on average. That speed difference wasn't even statistically significant. The overhead of writing prompts and interpreting responses ate into the time savings.

Debugging questions showed the widest gap between groups. Developers who had wrestled with Trio errors themselves could diagnose problems. Those who had an AI handle the errors couldn't.

Not All AI Use Was Equal

The most valuable part of the study is the six "interaction personas" the researchers identified. Three patterns led to poor quiz scores (below 40%):

  • AI Delegation: Fastest completion, lowest learning. These developers basically handed the task to GPT-4o.
  • Progressive AI Reliance: Started asking questions, gradually shifted to pure delegation.
  • Iterative AI Debugging: Used AI to fix errors without understanding them.

Three patterns led to strong scores (65-86%):

  • Generation-Then-Comprehension: Got AI-generated code, then asked follow-up questions about how it worked.
  • Hybrid Code-Explanation: Requested code with explanations attached.
  • Conceptual Inquiry: Only asked conceptual questions, solved errors independently. These developers were the second-fastest group overall.

The pattern is clear. Developers who used AI as a tutor learned nearly as well as those without AI. Developers who used AI as a coder learned almost nothing.

What This Actually Means for Daily AI Use

This study measured learning a new library from scratch, not productivity on familiar codebases. That distinction matters. A separate Anthropic study found up to 80% time savings when developers already had the relevant skills. The problem isn't using AI on code you understand - it's using AI instead of understanding code.

The study also only tested a chat interface, not agentic tools like Cursor or Claude Code that write directly into your editor. The researchers note that agentic tools would likely have "more pronounced" effects on skill development, since they remove even the step of reading and pasting code.

The practical takeaway: if you're working in a domain you already know, AI coding tools are a productivity win. If you're learning something new, you need to actively resist the urge to delegate. Ask the AI to explain, not to solve. The 7 developers in the "Conceptual Inquiry" group proved you can use AI and still learn - you just have to use it differently than most people's instinct.

As the researchers put it: "AI-enhanced productivity is not a shortcut to competence."