20%. That's how often AI-generated security patches pass every automated test in your CI/CD pipeline (the system that automatically builds, tests, and deploys code) and still break your application in production.
A new benchmark called SCA RemBench tested three leading AI coding tools across 25 open-source repositories with known security vulnerabilities. The task was straightforward: patch the vulnerability without breaking anything else. All three tools produced patches that looked correct, compiled fine, and passed automated test suites. But when those patches hit real-world conditions, one in five failed silently.
The Scores Tell a Partial Story
RemBench scored each tool on a 100-point scale weighted across three dimensions: compatibility (50%), correctness (30%), and precision (20%). The results:
- Gemini 3 Pro: 82.1
- Claude Code (Opus 4.6): 79.3
- Codex (GPT-5.2): 76.3
Those scores look decent in isolation. The problem is what a score of 82 actually means in a security context: roughly one in five patches introduces a new defect while fixing the original vulnerability.
How Green Tests Hide Broken Code
The failure patterns fall into three categories, and the most common one is the hardest to catch.
Compatibility failures accounted for the largest share. In one case, an Express.js patch removed utility functions and replaced them with null stubs. Tests passed because nothing explicitly checked those utilities. But in production, query-parameter parsing silently stopped working. Users wouldn't see an error message. They'd just get wrong results.
Correctness failures came next. A patch for the Go networking library quic-go used the wrong error sentinel from a different package. The code compiled and tests passed, but error-matching logic failed at runtime, meaning the application couldn't properly handle certain network conditions.
Precision failures were the subtlest. A urllib3 patch created an unnecessary wrapper subclass that technically worked but hid new functionality the security update was supposed to expose.
The common thread: every model treated vulnerability patching as a text-completion problem rather than a migration problem. They looked at the current code, saw what needed to change, and changed it. What they missed was the gap between the library's old API and its new secure version, including all the downstream code that depended on specific behaviors.
What Actually Helps
The benchmark found that adding a structured planning step before code generation improved scores by 7.4 points (to 89.5). That means forcing the AI to first analyze what changed between library versions, identify affected call sites, and map out a remediation plan before writing any code.
This tracks with what most experienced developers already know: the hard part of a security patch isn't writing the fix. It's understanding the full blast radius of the change.
For teams relying on AI tools like Cursor, Claude Code, or GitHub Copilot for security patches, the practical takeaway is blunt: a green CI pipeline is not proof that an AI-generated patch is safe. These tools are useful for drafting fixes, but the patches need human review that goes beyond "do the tests pass" and into "what behavior changed that the tests don't cover." Automated test suites were never designed to catch the kind of semantic drift that AI patching introduces, and until that changes, the 20% failure rate is the floor, not the ceiling.