On SWE-bench Verified, the top six coding models are separated by just 1.3 percentage points. They all hover around 80%. On SWE-rebench, which uses fresh, unpublished problems that no model could have trained on, the gap blows out to nearly six points. That difference tells you something important about how some models got their scores.
SWE-rebench exists because of a well-known problem with public benchmarks: if the test questions are public, labs can (intentionally or not) train their models on the answers. SWE-rebench fixes this by pulling real GitHub issues filed between January 2 and March 1, 2026, across 46 repositories. The February update also dropped the old 80-step execution limit in favor of a 128k context window (roughly a 300-page book), and removed pre-built demonstrations that could give models a head start.
The Actual Numbers
Here's the top 10, ranked by resolved rate (percentage of real-world coding problems the model successfully fixes on the first try):
| Rank | Model | Resolved | Pass@5 | Cost/Problem |
|---|---|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% | 70.2% | $1.12 |
| 2 | GPT-5.2 (medium) | 64.4% | 73.7% | $0.62 |
| 3 | GLM-5 | 62.8% | 70.2% | $0.76 |
| 4 | GPT-5.4 (medium) | 62.8% | 70.2% | $0.63 |
| 5 | Gemini 3.1 Pro | 62.3% | 75.4% | $0.66 |
| 6 | DeepSeek-V3.2 | 60.9% | 73.7% | $0.75 |
| 7 | Claude Sonnet 4.6 | 60.7% | 70.2% | $1.02 |
| 8 | Claude Sonnet 4.5 | 60.0% | 69.6% | $1.18 |
| 9 | Qwen3.5 (397B-A17B) | 59.9% | 71.9% | $1.18 |
| 10 | Step-3.5-Flash | 59.6% | 71.9% | $0.14 |
Pass@5 means the model gets five attempts to solve each problem. Cost per problem is the API spend for a single attempt.
The Contamination Signal
Compare these numbers to each model's SWE-bench Verified scores. On the public benchmark, Claude Opus 4.6 scores 80.8% and Gemini 3.1 Pro scores 80.6%, a gap of 0.2 points. On SWE-rebench, Opus leads Gemini by 3 full points. Some models drop more than others when the test can't be gamed.
Qwen3.5, which uses a mixture-of-experts architecture with 397 billion total parameters but only 17 billion active at once, lands at 59.9% here versus mid-70s on public benchmarks. Step-3.5-Flash shows a similar pattern. That doesn't prove deliberate benchmark contamination, but it strongly suggests these models have seen SWE-bench problems during training, whether through direct inclusion or through the many public repositories that discuss and reproduce those exact problems.
The Value Play Nobody's Talking About
Step-3.5-Flash at $0.14 per problem deserves attention. It solves problems at nearly the same rate as Qwen3.5 (59.6% vs 59.9%) while costing eight times less. For teams running coding agents on repetitive tasks like bug triage, test generation, or boilerplate migration, that cost difference compounds fast.
GPT-5.4 is another efficiency story. Despite being OpenAI's newest model, it uses fewer tokens per problem than any other top-5 entry while matching GLM-5's resolved rate at 62.8%. If you're paying per token, GPT-5.4 gives you the best ratio of capability to cost among the premium models at $0.63 per problem.
Gemini 3.1 Pro has the highest pass@5 score at 75.4%, meaning when given multiple attempts, it finds the right fix more often than any other model. That's a useful property for agentic coding tools that can run multiple solution attempts in parallel.
The practical takeaway for anyone choosing a coding model: SWE-bench Verified scores are now too compressed to be useful for comparison. When every top model scores between 79% and 81% on the same public test, those numbers stop telling you anything meaningful. SWE-rebench, by testing on problems the models couldn't have memorized, reveals which ones are actually better at reasoning through unfamiliar code.