LLMs in Smart Contract Audits Need Evidence

Cantina AI Code Analyzer applies evidence-backed workflows, turning LLM hypotheses into reproducible tests for smart contract audits.

LLMs in Smart Contract Audits Need Evidence

Paul

March 4, 2026

Best Practices

Smart contracts are a harsh target. They hold assets. They compose with other systems. They often run with limited upgrade paths and irreversible history. When something breaks, it breaks in public, under load, with incentives.

That is why contract security has always been about more than spotting “bad patterns.” You need to understand intent, state, permissions, and the ways external callers can turn small mistakes into real losses.

Large language models are starting to help with that work, but only if we use them in the right shape. A chat style answer is not a security tool. A security tool has to earn trust with evidence.

What LLMs are actually good at in contract review

LLMs do two things unusually well for this domain.

First, they build a quick mental model of unfamiliar code. Given enough context, they can summarize what a contract is trying to do, identify privileged roles, outline state transitions, and highlight external calls and trust boundaries. That accelerates the part of auditing that usually burns time, getting oriented.

Second, they generate useful hypotheses. A good model will notice suspicious ordering, missing checks, inconsistent assumptions across functions, and places where invariants could be violated. It can propose attack sketches, test cases, and patches that give an auditor a head start.

The key word is “hypotheses.” LLMs will sometimes sound certain when they are wrong. You cannot accept a finding because it reads well. You accept it because the tool can show a concrete execution path, or produce a failing test, or point to a violation of an invariant you can verify.

The difference between “LLM analysis” and an AI code analyzer

A serious code analyzer is not a prompt. It is a workflow that uses an LLM as one component.

In practice, that workflow has three jobs:

Give the language model the right context, without drowning it.
Force the model to express claims in a way that can be checked.
Filter output until the remaining set is worth an engineer’s time.

That is what “signal over noise” means in security tooling. It is not fewer alerts for the sake of fewer alerts. It is fewer alerts because each one carries an argument you can validate.

How an AI analyzer turns code into validated findings

Most modern analyzers start by turning code into structure. Solidity is readable, but a machine can do more with an abstract syntax tree, a control flow graph, and a call graph than with raw text alone. Those representations also make it easier to slice large codebases into coherent units without losing important relationships.

A typical pipeline looks like this:

Parse and index the code

The LLM tool builds a structured representation of the contracts, then indexes it so it can retrieve the right pieces when it needs them. This is where you capture “what calls what,” “what writes to storage,” and “what is externally reachable.”

Generate candidate issues

The model scans the structured view and proposes potential vulnerabilities. The output should be specific, not “this looks risky.” It should name the variables and functions involved, the preconditions, and the mechanism of failure.

Critique and verify

A second pass, sometimes another model, sometimes deterministic checks, sometimes both, tries to falsify the claim. This is where you cut noise. If the claim cannot survive basic scrutiny, it should not reach an engineer.

Verification can include targeted unit tests, fuzz cases, symbolic execution traces, or invariant checks. You do not need every finding to come with a full exploit. You do need it to come with something you can test.

Present results like an auditor would

High quality output looks like a security report, not a lint warning. It explains impact, shows a minimal reproduction path, and suggests remediation options without pretending there is only one “correct” fix.

This framing matters because it matches how teams actually work. Engineers do not patch warnings. They patch verified risks.

What these AI models can catch well today

LLM driven analyzers tend to do well on classes of issues where the code carries enough signal to reason about intent and execution order.

Reentrancy and unsafe external calls

Not every external call is a vulnerability, but external calls change the threat model. A good analyzer can identify value transfer patterns, where state updates happen, and whether an attacker can reenter through a callback path.

Access control errors

These are rarely exotic. They show up as missing checks, incorrect modifiers, unsafe role assignment, or a privileged function that is externally reachable through an unexpected path. LLMs can be effective at mapping “who is allowed to do what” across a contract suite.

Arithmetic and accounting mistakes

Since Solidity 0.8, integer overflow and underflow revert by default. That removed one large class of historical failures. The remaining risk is usually in unchecked blocks, custom math, or accounting logic that assumes a property that does not hold under edge cases.

Time and ordering dependencies

Smart contracts that rely on timestamps, ordering, or “first caller wins” logic often make assumptions about who controls scheduling. Those assumptions break under MEV, block builder incentives, and cross chain sequencing.

Upgrade and initialization footguns

Upgradeable patterns and initialization logic create a long tail of mistakes, especially around uninitialized storage, misconfigured admin roles, and upgrade paths that bypass intended checks. These are hard to catch with simple pattern matching because they depend on how multiple contracts interact.

None of this replaces deep smart contract review. It changes how quickly you can surface areas that deserve deep review.

A concrete example, reentrancy that a tool should prove

Consider a simplified “bank” style withdraw function. The pattern is familiar: send funds, then update state.

‍A basic AI code scanner can flag “external call before state update.” That is useful, but incomplete. A high signal analyzer should go further and make the risk concrete:

It should identify the state variable that protects funds, balances[msg.sender].
It should show that the balance check happens before the external call.
It should explain that a malicious receiver can reenter withdraw through the fallback path before the balance is reduced.
It should produce a minimal harness contract that deposits once, then reenters until the bank’s balance is drained, or until gas runs out.

That last step is the difference between suspicion and evidence. If a LLM can generate a plausible exploit story but cannot produce a test that fails, you treat it as a lead, not a finding.

And remediation guidance should reflect intent. Sometimes the fix is “update state before the call.” Sometimes the right fix is a reentrancy guard, or pulling funds to a separate withdrawal queue, depending on design constraints. The tool should not pretend it knows the product decision. It should make the security tradeoff explicit.

Where “zero day” discovery enters the picture

The interesting shift in 2026 is not that models recognize common bugs. Good security engineers already do that. The shift is that automated systems can explore edge cases at scale.

In controlled benchmarks and simulations, frontier models have been able to generate working exploits for a large share of historically exploited contracts, and they have also surfaced previously unknown issues in fresh code under test harnesses. Simulated settings simplify reality, but they are still a signal: attackers can automate parts of exploit development that used to be manual and slow.

That changes the defender’s timeline. When exploit discovery becomes cheaper, you should assume that if a vulnerability class exists in production, somebody will search for it.

Failure modes you need to design around

If you are evaluating an AI analyzer, focus less on demo finds and more on how it handles being wrong.

Hallucinated reasoning

Generative AI models can invent an exploit path that sounds coherent but does not compile, does not execute, or depends on nonexistent state. A tool needs a verifier layer that is willing to say “no.”

Missing context

Large systems exceed context windows. If the analyzer chunks code poorly, it will miss cross contract assumptions and subtle permission flows. Retrieval and graph based context selection matter more than a longer prompt.

Over confidence in heuristics

Many bugs are not “patterns.” They are mismatches between intent and implementation. A LLM tool must surface the assumption it is making. If it cannot, it is guessing.

Dual use pressure

Any technique that helps defenders can be repurposed by attackers. That does not mean you avoid the tools. It means you deploy them with guardrails, responsible workflows, and a clear plan for handling what they surface.

How to use AI in your workflow without fooling yourself

If you want value from AI in smart contract review, treat it like a high throughput research assistant with a strict supervisor.

Run it early, while architecture is still flexible. Fixes are cheaper before you have public integrations and on-chain liquidity.

Run it continuously, on every meaningful change. A point in time audit will miss the regression introduced tomorrow.

Gate findings on proof. At minimum, require a trace, a failing test, or a clearly stated invariant violation you can reproduce.

Keep humans on the critical decisions. Severity is not a LLM output. Severity is a function of business context, asset exposure, and realistic attacker capabilities.

What we are building at Cantina

We are building an AI code analyzer with one priority: high signal findings that engineers can act on.

That means fewer alerts, tighter reasoning, and outputs that are designed to be checked. If the tool cannot justify a claim, it should not ask for your time.

Join the waitlist to test it before launch

We are running early access ahead of release. If you want to put the analyzer on real code and see what it flags, and what it deliberately ignores, join the waitlist.

If your contracts are complex, upgradeable, or heavy on edge case logic, that is exactly the kind of surface we want to test.

FAQ

No items found. This section will be hidden on the published page.