Related ToolsChatgptClaude CodeCursorCody

Adafruit Tests Show LLMs Reproduce Open-Source Code Verbatim at High Rates

AI news: Adafruit Tests Show LLMs Reproduce Open-Source Code Verbatim at High Rates

53% verbatim overlap. That's how much of Adafruit's SSD1306 stats.py example script the Qwen2.5:14B model reproduced when prompted - not paraphrased, not inspired by, but the actual characters in the actual order.

Phillip Torrone, Adafruit's Managing Director, published results from 332 memorization probes run against open-weight large language models using Adafruit's own open-source code as test material. The findings paint a concrete picture of how AI models absorb and reproduce the code they were trained on.

The "Continue" Loophole

The most striking result involves how easily alignment guardrails crumble under a simple reframing. When asked to reproduce a file by name, the GPT-OSS:120B model refused about 60% of the time, producing trained refusal language about not being able to reproduce copyrighted content. But when given the first 40% of the same file and asked to "continue" writing it, the model complied every single time, producing identical output.

The technical distinction between "reproduce this" and "continue this" is real in terms of how the model processes the request. But the result is the same: the model's weights contain the code, and a trivial prompt change extracts it completely.

Code Gets Memorized Far More Than Text

Torrone's probes found that code memorization rates run 10 to 100 times higher than text memorization. Textbook content shows around 0.15% memorization overlap. Code - especially popular, widely-forked beginner tutorials and example scripts - reaches dramatically higher levels.

This makes intuitive sense. A textbook passage can be expressed thousands of ways. A Python function that initializes an I2C display has far fewer valid variations, especially when the entire community converges on the same example from the same library.

Torrone frames this through Borges's short story "Pierre Menard, Author of the Quixote," where a character independently recreates Don Quixote word for word without copying it. The parallel to LLMs is sharp: the model doesn't "copy" in any traditional sense, but the output is identical to the input it was trained on. As Torrone puts it: "The reward for writing something so good that everyone uses it is having it absorbed and reproduced without your name attached."

What This Means for Open-Source Authors

The findings highlight an uncomfortable asymmetry. The most useful, most adopted open-source code gets absorbed most completely into model weights. Legal distinctions between "text" and "code" in training data currently favor model developers over code authors, and the continuation framing sidesteps both alignment refusals and legal scrutiny.

This isn't a theoretical concern. If you maintain a popular open-source library, pieces of your code almost certainly live inside multiple commercial LLMs, served back to users without attribution. Torrone's contribution here is putting specific numbers on the problem using his own company's code as evidence, rather than arguing from hypotheticals.

For anyone using AI coding assistants daily, it's a useful reminder: the code your assistant suggests may not be as "generated" as it appears. Some of it is closer to recalled than created.