Should an AI Review Its Own Generated Code?

A thought experiment on the specific failure mode of asking a model to check the correctness of code it just wrote itself, and what a genuinely independent review would need to look like instead.

This one's speculative — a thought experiment about design principles, not a report on something I've built.

Asking a model to review code it just generated has the identical structural weakness as asking it to review any of its own output: if there's a systematic gap in its understanding of the problem, that gap produced the bug and will just as easily produce a clean bill of health on review, because it's the same reasoning checking its own work.

Execution is the honest reviewer

The most reliable check on generated code isn't another model's opinion about whether it looks correct — it's actually running it against real tests and observing what happens. Test execution doesn't share the blind spot a language model's self-review does, because it doesn't reason about the code at all; it just runs it and reports what's actually true, which is a fundamentally different and more trustworthy kind of signal.

A second model helps only if it's genuinely independent

A different model, or the same model given an adversarial instruction to find flaws rather than confirm correctness, can catch things self-review misses — but only if it's actually independent in a meaningful sense, not the same model asked the same way twice with a different label attached to the response.

The practical answer is layered, not either-or

The strongest setup isn't choosing between AI review and test execution — it's using generated tests plus real execution as the primary check, with an independent adversarial review as a second layer for anything that passes tests but still touches something consequential enough to warrant a second, differently-framed look before it ships.

I'm Jesse Myers — Marine veteran, 32 years in enterprise IT, now building production AI systems. This site is where I write about what I've actually built, and occasionally about ideas I haven't built yet but think are worth taking seriously.

The same blind spot writes and reviews

Execution is the honest reviewer

A second model helps only if it's genuinely independent

The practical answer is layered, not either-or