Two Agents, One Argument: Adversarial Verification as a QA Layer

A thought experiment on why a model reviewing its own output shares the same blind spot that produced the output in the first place, and what changes when the review step is a genuinely adversarial second agent instead.

Two Agents, One Argument: Adversarial Verification as a QA Layer

This one's speculative — a thought experiment about design principles, not a report on something I've built.

A model reviewing its own output for mistakes is a structurally weak check, and it's worth being precise about why: it's using the same reasoning that produced the output to go looking for the flaw in that output. If there's a systematic gap in how it thinks about the problem, the self-review shares that exact gap, because it's the same process run a second time with a different instruction on top. The more interesting question is what happens if the review step is a genuinely separate agent whose entire job is to argue against the first one, not confirm it.

Self-review inherits the blind spot

Ask a model to double-check its own work and you get, most of the time, a slightly more polished version of the same answer with the same underlying assumption intact — because nothing about asking it to review its own reasoning gives it access to a different way of thinking about the problem. Two independently framed agents, even built on the same underlying model, are far more likely to actually disagree on the weak point, because they're not both starting from the same completed answer looking for reasons to keep it.

Give the critic an explicitly adversarial job

The instruction framing matters more than it should. "Please review this for correctness" tends to produce polite confirmation with a few cosmetic suggestions. "Your only job is to find a reason this is wrong or incomplete, and you gain nothing by agreeing with it" produces meaningfully different output — not because the model has an incentive in any real sense, but because the framing changes what kind of response the instruction is actually asking for. The adversarial framing has to be explicit; a generic "review this" prompt defaults to agreement more often than it should.

Where the doubled cost is worth it, and where it's theater

This pattern doubles inference cost on every checked output, so it isn't free and isn't always worth doing. It earns its cost on high-consequence, one-shot outputs where a mistake is expensive to catch downstream — a generated contract clause, a financial calculation, a piece of code that touches production data directly. It's pure theater on low-stakes, high-volume outputs where being occasionally wrong is trivially recoverable; running an adversarial pass on every routine response is waste dressed up as rigor.

The failure mode of the pattern itself

Two agents can still converge into quiet agreement if they share too much context, or if the underlying model's own blind spot shows up in both of them regardless of how differently they're instructed. Genuine adversarial value requires deliberately different framing at minimum, and ideally a genuinely different underlying model — two calls dressed up as two agents, running the same model with cosmetically different prompts, is closer to the self-review problem than it looks.

The point is manufactured disagreement

The goal isn't "a second model checking the first model's homework" in the sense of a rubber stamp with extra steps. It's deliberately manufacturing genuine disagreement, because the disagreement itself — the specific point where two independent reasoning paths land on different conclusions — is where the actually useful signal lives. Agreement between two agents tells you very little. A specific, articulable disagreement tells you exactly where to look.


I'm Jesse Myers — Marine veteran, 32 years in enterprise IT, now building production AI systems. This site is where I write about what I've actually built, and occasionally about ideas I haven't built yet but think are worth taking seriously.