The Economics of Retrying: When a Second AI Call Is Cheaper Than Getting It Right the First Time
A thought experiment on when a cheap model plus a retry loop beats one expensive model call — and where that math stops working.
This one's speculative — a thought experiment about design principles, not a report on something I've built.
A counterintuitive trade
The intuitive move when accuracy matters is to reach for a more capable, more expensive model. Often the cheaper path to the same reliability is a fast, inexpensive model plus a validation check plus a retry on failure — two or three cheap attempts frequently cost less in aggregate than one expensive attempt, while landing at comparable final accuracy.
This only works when failures are detectable
The entire strategy depends on being able to tell, cheaply and automatically, whether an attempt actually succeeded — a validation check, a schema constraint, an internal consistency test. If failure can only be detected by a human reading the output, the retry loop has no signal to retry on, and the cheap-model-plus-retries approach collapses back into needing the expensive model, or a person, in the loop.
Retries have a tail that eats the savings
Most requests succeed on the first or second cheap attempt, and that's where the savings live. But a small fraction fail every retry, and a naive retry loop either burns unbounded cost chasing them or, worse, silently gives up. The design has to include an explicit ceiling — a fixed number of cheap attempts, then an escalation to the expensive model or a human, rather than an unbounded loop that quietly turns a cost-saving design into a cost-multiplying one on exactly the requests where it matters most.
Where the math flips
For latency-sensitive or one-shot-consequential requests — a real-time interaction, a document generated once and acted on immediately — retries cost time a user is actively waiting through, and that cost doesn't show up on an API bill but is just as real. The cheap-plus-retry pattern is a batch and background-processing optimization first, and a real-time one only when the latency budget genuinely has room for a few extra round trips.
I'm Jesse Myers — Marine veteran, 32 years in enterprise IT, now building production AI systems. This site is where I write about what I've actually built, and occasionally about ideas I haven't built yet but think are worth taking seriously.