Self-correction without infinite loops: agent stopping criteria that actually work

The Reflexion paper landed in March 2023 with a clean idea: when an agent fails a task, have it write a short verbal critique of what went wrong, and feed that critique into the next attempt. The numbers were striking — on HumanEval, the Reflexion agent improved GPT-4 from 80% to 91%, just by letting it try again with a self-written hint. Self-Refine, from roughly the same era, showed the same model could be generator, critic, and refiner without any extra training, with a ~20% absolute gain across tasks.

The papers were honest about their setup: small benchmarks, bounded retry counts, and clean pass/fail signals. The production interpretation was less careful. By mid-2024, half the LangChain demos shipped a “reflection node” that re-prompted the model with "Was that good? If not, try again." Cursor shipped a self-fix loop. Devin’s autonomy pitch leaned heavily on the idea that the agent would catch its own mistakes.

What everyone discovered, somewhere between $50 and $500 of unexpected spend, is that reflection loops have two failure modes the academic papers didn’t surface. The first is the loop that never terminates — the agent decides every output is “almost good” and keeps revising. The second is the loop that improves working code into broken code — the critic finds a flaw that isn’t there, the refiner “fixes” it, and now the tests fail.

This post is about how the teams shipping real agents stopped trusting self-correction and started engineering termination.

Why “let the model decide it’s done” doesn’t work

The seductive thing about Reflexion is that the loop is self-contained. The model writes the answer, the model judges it, the model decides whether to keep going. No external infrastructure required. This is also exactly why it breaks.

Three independent things go wrong, often at once:

The critic is the generator. When the same model that wrote the answer is asked to grade it, the failure mode is well-understood — it agrees with itself. Worse, it agrees with confident wrong things. The Self-Refine paper reported that its gains came mostly from cases where the critic and generator were prompted with very different framings. Without that, the loop is theatre.

There is no ground truth. The critic is operating on vibes — “does this look like a good answer?” — not on whether the answer is correct. The model has no idea whether the SQL query it just wrote returns the right rows. It has opinions about how the SQL is formatted.

The stopping condition is also the model. “Should I revise this again?” is asked of the same vibes-based critic. The result, in production, is one of two extremes: the model loves its first answer and refuses to improve it (the bug Self-Refine was trying to fix), or it never loves any answer and revises forever (the bug Self-Refine created).

A widely-cited internal report from one of the autonomous coding vendors captures the production pattern: a single Devin task on a real bug typically runs 30 to 120 minutes, making 400 to 1,200 tool calls and 200 to 600 LLM calls, with context regularly above 100K tokens because the agent keeps re-reading the codebase. When the loop is bad, the gateway in front of it has to interrupt — the agent will not. One autonomous coding agent spent $180 on a single bug-fix task and returned a pull request that didn’t compile.

The naive loop on the left has no termination signal that isn’t the model itself. The split on the right replaces “is this good?” with “do the tests pass?” — a question with a yes-or-no answer.

The pattern that ships: verifier-actor split

The teams shipping reliable self-correcting agents — Aider, Sweep, Cursor’s agentic mode, Replit’s Agent — converged on the same structural answer. Stop using the LLM as its own critic. Use deterministic code as the critic instead. The LLM proposes. The compiler, linter, type checker, test runner, or schema validator disposes.

Aider’s design is the cleanest expression of the pattern. When you ask Aider to make a change, it produces a diff, applies it, and then runs the tests. If the tests fail, the failure output is included in the next prompt and the model retries. If the tests pass, the loop ends — full stop. The model is never asked “are you happy with this?” because the question is meaningless. The question is “are the tests green?” and the test runner answers it.

The same pattern, in slightly different clothing, shows up in every production coding agent:

Sweep’s loop uses the project’s linter and test suite as the gate. Each step proposes a code change, runs the verifier, and either commits or retries with the verifier output included in context.
Cursor’s agentic mode uses TypeScript errors and test output the same way. When you ask Composer to add a feature, it loops until the type checker is happy, then stops.
Replit’s Agent uses its snapshot engine — every step writes to a checkpointed filesystem, runs the app’s tests in the sandbox, and either advances or rolls back. The verifier is “did the build pass?” The actor is the model.

The structural property all of these share: the gate that decides whether to loop is not the LLM. It is code. The LLM is allowed to retry; the LLM is not allowed to declare victory.

What “verifier” actually means outside coding

Coding is the easy case because the verifier is sitting right there in the form of a compiler. The pattern still works in other domains; the challenge is finding a verifier you trust.

In retrieval-augmented Q&A, the verifier is usually a separate “groundedness” check: a small model or rule-based system that confirms every claim in the answer cites at least one retrieved passage and that the passage actually supports the claim. The RAGAS family of metrics captured this pattern. When the verifier fails, the agent retries with a broader retrieval; when it passes, the loop ends.

In data extraction and structured output, the verifier is a schema check plus a sanity check on the values. Did the model produce valid JSON? Are the dates in the future when they should be? Is the total equal to the sum of the line items? Failed schema, failed sanity check, retry.

In tool-use agents, the verifier is whether the tool call returned a success status and whether subsequent tool calls confirm the world state matches expectations. Anthropic’s harness design guidance calls this the “ground truth via tools” pattern: the agent should not be trusted to know if its action succeeded; a follow-up tool call should confirm it.

Where the pattern struggles is open-ended generation — write me a poem, summarise this conversation, draft an email. There is no compiler for prose. The teams that ship anyway either give up on auto-iteration (one shot, no critic) or fall back to weak verifiers like length bounds, keyword presence, or — increasingly — a separate cheap model fine-tuned as a quality classifier. The fine-tuned classifier is closer to a verifier than to a critic, because it produces a scalar score with a calibrated threshold, not a free-text “could be better.”

Stopping criteria that actually terminate

A verifier solves the “what counts as done” problem. It does not solve the “what if the verifier never passes” problem. For that, you need explicit termination conditions that fire regardless of the verifier state. The four that show up in every shipped agent:

1. A hard step budget. The Anthropic Claude SDK defaults to 20 steps, and most production agents pick a number between 10 and 50 depending on the task. The budget is non-negotiable. When it’s hit, the agent surfaces the best-so-far output (or a “failed” status) and stops. No further LLM calls.

2. A no-progress detector. If the verifier output hasn’t changed in N iterations — same test failing, same lint error, same JSON parse mismatch — the loop terminates. The model has demonstrated it doesn’t know how to fix this particular failure; another iteration just spends tokens. Aider implements this; Devin’s gateway implements a coarser version of this.

3. A token / cost ceiling. The loop carries a budget that decrements with every LLM call, and when the budget hits zero the loop terminates even if the verifier hasn’t passed and the step counter hasn’t tripped. This is the only safeguard against an agent that thinks it’s making progress (so the no-progress detector doesn’t fire) but is actually just churning expensive tokens.

4. A confidence threshold on the verifier itself. When the verifier is a model (a fine-tuned classifier, a small LLM-as-judge), it returns a score. The loop terminates as soon as the score crosses a calibrated threshold — and “calibrated” means you ran an offline eval and chose the threshold where false-positive and false-negative rates are acceptable. Not “the model said it’s good.”

  while not done:
    step += 1
    output = actor(state)
    verdict = verifier(output)               # deterministic or scored
    state.append(output, verdict)

    if verdict.passed:                       # criterion 1 — verifier green
      return output
    if step >= MAX_STEPS:                    # criterion 2 — step budget
      return best_so_far(state, "step_budget_exceeded")
    if no_progress(state, window=3):         # criterion 3 — no-progress
      return best_so_far(state, "stalled")
    if tokens_used >= BUDGET:                # criterion 4 — cost ceiling
      return best_so_far(state, "budget_exhausted")

The thing to notice about the four conditions: none of them ask the model anything. They are all observable, deterministic properties of the loop’s state. This is the load-bearing design choice. The moment any of your termination conditions is “ask the LLM if we should stop,” the loop is back to vibes.

Reflexion, properly used

None of this means reflection itself is wrong. It means reflection without a separate verifier is wrong. The Reflexion paper’s actual formulation always included an external evaluator — the paper used unit tests for HumanEval, environment rewards for ALFWorld, and exact-match for HotpotQA. The self-critique was layered on top of a ground-truth signal, not in place of it.

The production version of Reflexion looks like this: when the verifier fails, you don’t just retry. You ask the model to write a short post-mortem of why the verifier failed, what specifically went wrong, and what to try differently. That post-mortem goes into context for the next attempt — exactly as the paper described. The win is not that the model judges its own work. The win is that the model uses the verifier’s output to plan a better second attempt.

Aider does this. When tests fail, Aider includes both the failing test output and a model-generated short hypothesis (“the test failed because the function returns a string but the test expects an int; let’s change the return type”). Sweep does the same thing with linter and type errors. The reflection step is bounded, structured, and feeds into a deterministic gate — not a self-referential one.

Reflexion done right: the reflection step is downstream of a real verifier, not in place of one. The episodic memory holds the hint, but the loop is gated by deterministic code.

What to take away

The pattern that survived contact with production is narrow and almost boring: don’t let the model decide when it’s done. Three concrete practices follow from it.

Build the verifier before the loop. If you don’t have a check that reliably distinguishes a correct output from an incorrect one, you don’t have a self-correcting agent — you have a self-deceiving one. For code, the verifier is the test suite plus the type checker. For structured data, it’s the schema plus business-rule checks. For open prose, it’s probably nothing, and that’s a signal not to add a reflection loop in the first place.
Make every termination condition observable. Step count, no-progress detector, token budget, verifier score with a calibrated threshold. Not “model says done.” If you can’t write the stopping condition as a pure function over the loop’s state, you don’t have a stopping condition.
Use reflection as input to the actor, not as the gate. The self-critique is useful when it feeds the next attempt with a concrete hypothesis (“the test failed because of X, try Y”). It is dangerous when it decides whether the next attempt happens at all.

The hard lesson of the last two years of agent deployment is that “self-correcting” is a property you have to engineer, not a property that emerges from giving the model a critic prompt. The teams who internalised this ship agents that converge. The teams that didn’t ship agents that burn $180 to produce a pull request that doesn’t compile.

Further reading: the Reflexion paper and Self-Refine are the foundational references, and rereading them through a production lens is illuminating — both papers always assumed an external evaluator. Anthropic’s harness design notes are the best modern guide to bounding agent loops. For a concrete worked example, Aider’s scripted self-fix loop is short, readable, and a good template.