Research Notable

AI Auditor Scores 0 Real Findings on Ethereum Smart Contract Security Test

March 9, 2026 2 min read

$5.66 and four false positives. That's what MAGIC Grants got when they ran Zellic's V12 - billed as "the only autonomous Solidity auditor that finds critical bugs" - against a real Ethereum smart contract that had already passed a human security review by Trail of Bits.

V12 reported four vulnerabilities. It then auto-invalidated two of them itself (one critical, one low). The two remaining flags were both wrong.

The Fix That Would Have Created a Vulnerability

The most telling result: V12 flagged a 2,300-gas stipend restriction as a critical issue, claiming ETH transfers would fail for contracts. Its suggested fix was to use gas() instead, which would have removed the gas limit entirely. That "fix" would actually introduce a reentrancy vulnerability (a type of attack where a malicious contract calls back into your contract repeatedly to drain funds). The AI identified a security pattern, misunderstood why it existed, and recommended removing it.

The second remaining flag claimed deposit amounts weren't properly reconciled with received tokens. The code's own documentation explicitly stated that non-standard ERC20 tokens are intentionally unsupported. V12 didn't read the docs.

A 70.6% Benchmark Score Means Nothing Here

V12 scores 70.6% on EVMBench, a benchmark created by OpenAI and Paradigm for evaluating AI agents on smart contract vulnerability detection. That number sounds respectable until you see results on real production code. Benchmarks test known vulnerability patterns in isolated scenarios. Real contracts have intentional design tradeoffs, documented edge cases, and context that benchmarks don't capture.

This gap between benchmark performance and real-world usefulness is one of the most persistent problems in applied AI. A model can ace the test and still be useless - or dangerous - in practice.

Where This Leaves AI-Assisted Auditing

None of this means AI will never be useful for security auditing. But right now, the failure mode is exactly wrong for this use case. A security tool that generates false positives is annoying. A security tool that recommends introducing new vulnerabilities is actively harmful.

The MAGIC Grants test was small - one contract, one tool. But the results match what security researchers have been saying: current AI models can pattern-match known vulnerability types but lack the contextual reasoning to understand why code is written a certain way. Until that changes, AI auditors are best used as a first-pass filter that always gets human review, not as autonomous security agents making real decisions about production contracts holding real money.

The Fix That Would Have Created a Vulnerability

A 70.6% Benchmark Score Means Nothing Here

Where This Leaves AI-Assisted Auditing

Related Tools

More from today

AI Coding Agents Have an 85% Prompt Injection Success Rate Problem

MCP Connects AI Agents to Tools but Ignores Data Governance

Three Prompt Injection Attacks That Can Hijack Your Email-Connected AI Agent

Cookie Preferences