Testing an AI System Is Not the Same as Testing Software, and Treating It Like Software QA Misses the Real Risks

A thought experiment on the specific ways traditional software testing practices fall short for a system whose core behavior is probabilistic rather than deterministic.

Testing an AI System Is Not the Same as Testing Software, and Treating It Like Software QA Misses the Real Risks

This one's speculative — a thought experiment about design principles, not a report on something I've built.

Unit tests assume a fixed answer exists

A traditional unit test checks that a specific input produces a specific, exact output, every time. That model breaks down for an AI system, where a correct answer can legitimately be phrased several different ways, and a rigid exact-match test either fails constantly on perfectly good output or gets loosened so much it stops catching real regressions at all.

Test properties of the output, not the exact text

The more durable approach is testing structural and semantic properties instead of literal text — does the output contain the required fields, does it stay within an expected length and tone, does it avoid a specific category of claim it shouldn't make — properties that remain meaningful across the natural variation in language model output, rather than breaking on every harmless rewording.

A representative test set matters more than test count

A thousand tests built from convenient, easy-to-write examples miss more real risk than fifty tests built deliberately from the actual hard cases the system needs to handle correctly — the edge cases, the ambiguous phrasings, the adversarial-sounding inputs. Test count is a vanity metric here in the same way total usage is a vanity metric for adoption; coverage of the cases that actually matter is what counts.

Production monitoring is part of the test suite now

Because a language model's behavior can shift with a silent provider-side update in a way traditional software's behavior does not, pre-launch testing alone isn't sufficient — ongoing production monitoring against the same behavioral properties the test suite checks has to be treated as a continuous, permanent extension of testing, not a separate concern that starts once testing is considered done.


I'm Jesse Myers — Marine veteran, 32 years in enterprise IT, now building production AI systems. This site is where I write about what I've actually built, and occasionally about ideas I haven't built yet but think are worth taking seriously.