On February 18, 2026, OpenAI and Paradigm released EVMbench, an open benchmark evaluating whether AI agents can detect, patch, and exploit high-severity smart contract vulnerabilities.
In a sandboxed setting, the best AI agent successfully drains funds 72.2% of the time. Once an agent enters the right environment with a clear objective, the final mile of exploitation executes reliably.
The Scoreboard That Matters
EVMbench evaluates models across three modes:
- Detect (120 vulnerabilities): Top score of 45.6% (Claude Opus 4.6).
- Patch (45 vulnerabilities): Top score of 41.5% (GPT‑5.3‑Codex).
- Exploit (24 vulnerabilities): Top score of 72.2% (GPT‑5.3‑Codex).
Exploitation capabilities currently outpace detection and remediation.
What EVMbench Measures
EVMbench functions as an end-to-end capability test built for measurable execution. Key design choices include:
- Audit-Grade Data: Vulnerabilities are sourced from public competitions and real-world Tempo blockchain auditing scenarios.
- Deterministic Grading: A Rust-based harness deploys contracts and grades exploits programmatically via onchain analysis.
- Local Environments: Exploit tasks execute in a local Anvil environment.
The Hint Experiment
When the agent receives a medium-level "mechanism hint," GPT‑5.3's performance skyrockets:
- 93.9% on Patch
- 73.8% on Exploit
Pointing an agent at the broken mechanism makes patching near-automatic. Localization and triage act as the primary bottlenecks. This data must dictate how teams build their security workflows.
How to Read the Results
Translating EVMbench scores into real-world probability requires understanding its parameters:
- Detection limits: The system checks if an agent found the exact flaws identified by human auditors. Unverified additional findings remain uncredited.
- Structural constraints: Transactions replay sequentially on a clean local instance. Timing-dependent behaviors, MEV dynamics, and multi-chain complexities fall outside the scope.
- Lopsided reward weights: The top 10 highest-paying vulnerabilities represent ~73% of the total award mass. The benchmark is highly sensitive to specific critical bugs.
The Trend Line is Accelerating
EVMbench arrives shortly after Anthropic Fellows’ SCONE benchmark (December 2025), which reported a 51.1% exploitation success rate across 405 real-world contracts.
Independent methodologies yield the same conclusion: agentic exploitation is improving rapidly. Paradigm noted that top models previously exploited under 20% of critical fund-draining bugs. Today, GPT‑5.3‑Codex exceeds 70%.
Cantina’s Take: Retool for Agent Speed
The adversary loop is compressing. Attackers require a single working path. Defenders require breadth, accuracy, and safe patches.
If you ship smart contracts, implement these steps:
- Treat triage as a first-class system: Produce minimal, reproducible problem statements (clear invariants, failing tests, precise scopes).
- Measure false positive cost: High noise burns out engineering teams. Pipelines must effectively separate signal from noise.
- Make continuous security a reality: Implement continuous scanning and active bug bounties to maintain coverage as code evolves.
- Reduce blast radius by design: Implement pausable components, strict role boundaries, rate limiting, and rehearsed incident response to contain failures.
Where Cantina Fits
Defenders require speed, scale, and high signal. Cantina pairs AI automation with expert human judgment to validate actionable findings, supported by a massive researcher community. AI agents now move from "possible exploit" to "working exploit" inside a single loop, increasing the need for validated intelligence.
EVMbench’s best agent drains funds 72.2% of the time, while detection and patching remain in the 40s. The core security challenge has shifted directly to finding and fixing vulnerabilities at high speed.
Contact us to stay ahead of the current threat landscape.
.jpg)