Every AI coding assistant on the market today shares the same blind spot: they can write code that passes your tests, but they cannot prove that code is correct.
That distinction sounds academic until you hit a bug that only surfaces under rare conditions in production. A new analysis from Predictable Machines walks through concrete examples of how code that looks right - and that AI would happily approve - breaks down when you examine all possible inputs, not just the ones in your test suite.
The $10 Withdrawal That Goes Negative
The clearest example is a Java banking app with a withdrawal method. The code checks whether the account has sufficient funds before allowing a withdrawal. Straightforward, and any AI assistant would call it correct. But the method doesn't account for fees charged after the transaction. Withdraw $10 from a $10 balance, get hit with a $2 fee, and the account drops to negative $2. The "balance can never go negative" rule is violated, but no individual test case would catch it unless someone specifically thought to test that exact scenario.
Scale this up to a transfer method where fees compound across operations, and you get money disappearing from the system entirely. Transfer $10 from an $11 account with a $2 fee and the total money in the system shrinks by $3. Another example: add Integer.MAX_VALUE + 1 in Java and the number silently wraps to a negative value with no error thrown.
These aren't exotic edge cases. They're the kind of bugs that ship to production because they live in the gaps between unit tests.
What Formal Verification Actually Does
Formal verification is a mathematical approach that checks code against every possible input, not just the ones a developer (or an AI) thought to test. Instead of running the code with sample data, it uses mathematical proofs to guarantee properties hold universally. When it finds a violation, it produces a concrete counterexample.
The tradeoff is cost. Formal proofs historically require 5 to 20 times more effort than the code they verify, a ratio known as the de Bruijn factor. That's why formal verification has mostly been confined to aerospace, medical devices, and financial infrastructure where bugs have catastrophic consequences.
Companies like Predictable Machines (who published this analysis) are building tools to lower that cost using automation across Java, Kotlin, Python, and C++. The pitch is that AI handles the easy verification while humans focus on the hard proofs.
Where AI Assistants Actually Fall Short
The real limitation isn't that AI writes buggy code. It's that AI operates within a context window and evaluates code locally. An LLM can spot problems in the function it's looking at, but it struggles with global system properties that span multiple components. Does the total money in the system stay constant across all transaction types? Does this integer field stay positive through every code path? Those are the questions formal verification answers and AI currently cannot.
For most daily coding work with tools like Cursor, Claude Code, or GitHub Copilot, this gap won't matter. But for anyone building payment systems, medical software, or anything where a subtle bug means real-world harm, knowing that your AI assistant "thinks" the code is correct is not the same as knowing it actually is.