Getting tool calling to 99% reliability in production
Most production agent failures are not model failures. They are tool-call failures — the model picked the right tool but wrote the wrong arguments, or hallucinated an argument that doesn't exist. The fix is unglamorous and mostly about schema design.
If you have shipped an agent and watched its production logs for a week, you already know that the model itself is rarely the thing that breaks. The model is annoyingly good. What breaks is the bridge between the model and your code: the tool call. The argument that was supposed to be one of three enum values is a freeform sentence. The integer parameter is a string. The optional field that the schema marked optional is missing in 12% of calls and the downstream API throws a 500.
The frustrating part is that none of these are mysteries. After enough production hours, the failure modes are countably few and each has a specific fix. The teams I see hitting 99%+ tool-call reliability are not using better models. They are following a small set of unglamorous schema-design rules that compound into a system that simply works.
This post is about those rules. It’s structured around the five failure modes I see in production logs and what to do about each.
Why tool calling is the bottleneck
A useful frame, before the failure modes. The model’s job in a tool call is to translate the user’s intent into a structured invocation of a function you wrote. There are three places where that can go wrong:
The good news, looking at this, is that the part that can be fixed with schema design (step 3) is also the part that fails most often. The fix is structural, not statistical — you do not need to “train the model to be better.” You need to give it less rope to hang itself with.
Failure mode 1 — Free-form JSON where an enum should be
This is the most common production failure and the simplest to fix. Your schema looks something like:
{
"name": "set_priority",
"parameters": {
"type": "object",
"properties": {
"priority": { "type": "string", "description": "The priority level" }
}
}
}
The model will write "priority": "high" most of the time, and "P1",
"urgent", "top", "critical", or "the highest one" the rest of
the time. Each of those is a parse error downstream.
The fix is to use enum everywhere a closed set is possible, and to
mean it:
{
"priority": {
"type": "string",
"enum": ["low", "medium", "high"],
"description": "Priority level. Use 'high' only when explicitly indicated by the user."
}
}
Both Anthropic and OpenAI’s tool-use docs make this point, but it deserves repeating: the model treats your JSON schema as a hard constraint when you give it one, and treats your description as a suggestion when you don’t. Anywhere the universe of valid values is finite, write the enum. This includes statuses, regions, environments, languages, time units, currencies, and the categorical dimensions of any business object you manipulate.
The same principle extends to integers (minimum, maximum), strings
(pattern for regex constraints), and arrays (minItems, maxItems).
The model respects these.
Failure mode 2 — Descriptions written for the model, not for humans
The second most common failure is descriptions like
"The priority to set (low, medium, or high)" — written to the model,
as if it were a query language. Two problems with this:
- The information should be in the schema, not the prose. If the
valid values are
low | medium | high, the enum should say so. The description is wasted real estate when it repeats schema content. - The description should explain when and why to use the tool, not what arguments it takes. That is what the model needs from the prose part.
A useful test, from the Anthropic tool use guide: imagine a new hire onboarding to your codebase. Would your tool description help them understand when to call this function and what the function’s side effects are? If not, rewrite it.
A bad description:
"set_priority(item_id, priority)"
A description written for the model:
"Sets the priority of an item. item_id is the ID, priority is low/medium/high."
A description that actually works:
"Set the priority on a support ticket. Use when the user explicitly asks to change priority or escalate. Do NOT call this for new ticket creation — use create_ticket instead. Priority changes are visible to the customer immediately."
The last version tells the model when to reach for the tool, when not to, and what the consequences are. That is what improves reliability, not adding more parameter docs.
Failure mode 3 — Ambiguous tool selection
This is the failure where you have ten tools defined, and the model
calls search_docs when it should have called search_tickets. The
diagnostic question is always the same: are the tool names and
descriptions disambiguating?
In a real codebase I audited, an agent had three tools:
find_user(query)lookup_customer(query)search_account(query)
These three were defined by different engineers in different sprints,
all backed by the same database with slightly different filters. The
agent picked one essentially at random. The fix was not a better
model. The fix was deleting two of them and renaming the survivor to
find_user_by_email_or_id with a description that explained exactly
what it returned and when to use it.
The general principle: the tool surface area should look like an API designed for the agent, not the union of every helper function your team has ever written. If two tools could plausibly answer the same question, you have one tool too many or one tool not specific enough.
A useful rule of thumb: if you cannot, looking only at the tool names and one-line descriptions, predict with high confidence which tool the model will pick for a given query, neither can the model.
Failure mode 4 — Silent schema drift
This is the failure that catches teams off-guard months into a
production deployment. The original schema said
"item_id": { "type": "string" }. Six months later, the backend was
refactored to use UUIDs, and the validation logic now requires a
specific UUID format. The schema was never updated. The model still
sends arbitrary strings. Calls fail in production at low but
non-zero rates and nobody knows why.
The structural fix is to make the schema the source of truth and
generate the validation logic from it. JSON Schema validators
(ajv for Node, jsonschema for Python, Pydantic for typed Python)
are mature and fast. If your schema says UUID, the validator
enforces it. If your backend’s expectations change, the schema
changes first and the model immediately starts producing the new
shape.
In practice, this means three habits:
- A single source for each tool’s schema. Not “the OpenAPI doc says one thing, the agent definition says another.” Generate both from the same file.
- Validate every tool call before executing it. Run the
model-produced JSON through a JSON Schema validator. On
validation failure, return a structured error to the model
(“argument
item_idmust match pattern[uuid]”) and let it retry once before giving up. - Test the schema against a small golden set. A dozen representative user queries, mapped to expected tool calls. Run this in CI. When you change a schema, the golden set catches the breakage before production does.
The teams that hit 99% are not heroic firefighters — they have turned tool-call reliability into a property the test suite enforces.
Failure mode 5 — The over-eager retry loop
The opposite failure mode, and surprisingly common: the agent encounters one bad tool call, retries it, gets another bad result, retries again, and ends up in a self-reinforcing spiral that burns $8 of tokens before timing out.
The pattern that works in production:
The two non-obvious moves are:
- Cap retries at two, not infinity. Three attempts means the model gets the original chance plus two corrective tries. Beyond that, the failure is not transient — it’s structural, and burning more tokens won’t help.
- The error message must be specific. Telling the model “that
failed, try again” produces an identical retry. Telling it
"argument 'priority' was 'urgent' which is not in the allowed enum [low, medium, high]"produces a corrected retry with very high probability. The error is half the prompt.
What 99% actually buys you
The reason this matters in production is not the failed calls themselves — those are recoverable. The reason is the downstream amplification. An agent that successfully completes 95% of tool calls completes 0.95^5 = 77% of five-step plans. At 99% per call, the same plan completes 95% of the time. The compounding is what makes the difference between “this agent feels reliable” and “this agent is unusable for any task with more than two steps.”
The teams getting there in 2026 share a few cultural habits:
- One engineer owns the tool schema as a deliverable. Not “the team writes tool definitions as they go.” A single owner who can veto a tool addition the way an API team vetoes new endpoints.
- The schema is reviewed before the implementation. Tool design is treated like API design. The schema is the contract; the implementation has to match.
- Production tool calls are logged with their full input and output. Sampled, anonymised, and reviewed weekly. The first time a new failure mode appears, somebody notices because they are looking.
- Eval set covers the boring cases. Not “can the agent handle a
user asking it to plan a wedding while juggling chainsaws.” Can
it correctly call
set_prioritywhen the user says “make this urgent.” The mundane cases are the ones that matter for reliability.
What to take away
- Most production agent failures are tool-call failures, and most tool-call failures are schema-design problems. The fix is not a better model; it is a tighter schema.
- The five failure modes are countably few: free-form where an enum belongs, descriptions written to the model instead of for humans, ambiguous tool selection, silent schema drift, and the over-eager retry loop. Each has a structural fix.
- Treat the tool surface like an API. One owner, schema-first design, validation in CI, logs reviewed weekly. The boring discipline is what gets you from 95% to 99%.
- 99% per call is what makes multi-step agents feel real. It is worth the unglamorous engineering work to get there.
If you take one thing from this post: read your tool definitions out loud, as if a new hire were going to call those functions without any other context. If anything is unclear, the model will also find it unclear. The cure is exactly the cure for any badly documented API — better docs, tighter types, fewer overlapping endpoints.
Further reading: Anthropic’s Tool use overview and Tool use best practices are the cleanest official references. OpenAI’s function calling guide covers the same ground from the OpenAI API angle. For the schema-validation tooling, see Pydantic for Python and Ajv for Node.