2026-02-19 07:19:26

We recently saw a $1.78M exploit caused by a vulnerability written by Claude Opus 4.6.

cbETH was priced at $1 instead of $2,000.
Not long after @OpenAI launched EVMbench. To put it simply, it’s a benchmark that evaluates AI agents' ability to interact with smart contracts.

It has 3 main evaluation modes:
> Detect: analyzes the agent’s ability to detect vulnerabilities
> Patch: analyzes the agent’s ability to fix those vulnerabilities
> Exploit: analyzes the agent’s ability to exploit those vulnerabilities
Their analysis showed that recent models (Opus 4.6, GPT-5.3-Codex, etc.) are very good at exploiting vulnerabilities, but weak at detecting and patching them.
And that’s exactly what I’ve observed while running my own agents on the latest models. In my agent team, I always include an auditor agent that gets full context, with the main objective of finding vulnerabilities.
When it finds one, the dev agent fixes it easily.
But the issue is that out of 10 vulnerabilities, it might only find 3. For now, we simply can’t rely on agents to properly detect vulnerabilities.
Launching this benchmark is a very strong move. I’m excited to test it with my agents.
To be clear, this is not a security scanner or a production-ready audit tool. It’s mainly meant to measure AI capabilities, compare models, and provide metrics on how AI is progressing in this field.
Basically, it’s a tool that allows AI to be evaluated and improve in this domain, and tbh, we really need that.

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

1 Likes