Inference routing: sending each query to the cheapest model that can answer it
By 2026 the biggest lever on inference cost isn't quantisation or batching — it's deciding which model touches each query. Four routing patterns, three rounds of vendor consolidation, and a real case study where a customer support agent cut spend 80% with no measurable quality drop.
The unspoken truth of every “we cut LLM costs by 90%” post you’ve read is that the team didn’t really negotiate down their per-token price. They changed which model was answering each query. That’s it. That’s the trick. Most production LLM bills are dominated by the smallest fraction of queries that genuinely need the smartest model, and the routing decision — which model gets which query — turns out to be the single highest leverage point in the entire inference stack.
This used to be folklore: experienced teams routed by gut feel, inexperienced teams paid GPT-4 prices for “what’s 2+2.” By 2026 it’s a real engineering discipline with four named patterns, a handful of production-grade vendors, and benchmark data that says the same thing every time — most queries are easy, smart routing captures the savings, quality barely moves. This post is the working version of that discipline.
Why the routing decision dominates everything else
A worked example, since the math is the punchline. A customer support agent handling 10M queries a month. Three obvious options:
- All Claude Sonnet at ~$3 input / $15 output per million tokens. Assume 1.5K input / 500 output tokens per query — call it ~$11.25 per 1K queries. Total: ~$112K/month.
- All Claude Opus at ~$15 / $75 per million tokens. ~$60 per 1K queries. ~$600K/month.
- Smart routing — 70% to Haiku ($1 / $5), 20% to Sonnet, 10% to Opus. Weighted cost: ~$5.40 per 1K queries. ~$54K/month. About 80% lower than all-Opus, 50% lower than all-Sonnet.
The same math holds with open-weight models. Routing 70% of traffic from
Llama-3.3-70B ($0.88-0.90/M on Together, Fireworks) to Llama-3.1-8B
($0.18/M on Together) cuts the bill by a similar margin, because the
input-token-weighted average plummets. The model choice on each individual
query matters less than the model mix across all queries.
The empirical question is whether this hurts quality. RouteLLM’s published result is the cleanest data we have: their best router achieves 95% of GPT-4’s quality on MT Bench at 85% lower cost. That number compresses a lot of nuance, but the direction is unambiguous. For most production workloads, routing is the closest thing to free lunch on the menu.
The four routing patterns
Pattern 1 — Rule-based
The first cut. A regex match on the user’s input, a keyword list, a URL path. “If the query mentions a SKU number, route to the product-info agent. If it mentions ‘refund,’ route to the refund agent.” Zero inference latency, fully deterministic, easy to debug. The weakness is obvious: anything novel falls into a default bucket.
The production version is more thoughtful than it sounds. Most production agents have a router prompt — a system prompt for a small fast model that classifies the request into a category — but a lot of the routing logic that actually fires is rule-based code wrapping that classifier. Anthropic and OpenAI’s own product surfaces do this; the customer-facing “Claude” app and ChatGPT both look at the request and pick a model (Haiku, Sonnet, Opus / 4o-mini, 4o, o1) in part by rule.
Pattern 2 — Model-based
A small classifier (sometimes an LLM, often not an LLM) takes the request as input and picks the model. This is what RouteLLM does: train a router on preference data from Chatbot Arena, predict the win-rate of the strong model vs the weak model on this query, route based on a configurable threshold.
The RouteLLM published numbers are worth quoting in detail:
- 85% cost reduction on MT Bench at 95% of GPT-4’s quality
- 45% cost reduction on MMLU at the same quality bar
- 35% cost reduction on GSM8K (math is hard; less routable)
These numbers vary a lot by workload. MT Bench has lots of “easy” queries that don’t need a frontier model; MMLU has many factual queries where small models do fine; GSM8K is multi-step reasoning where the weak model fails more often, so the router has to send more to the strong model.
Model-based routing is the bulk of what 2026 vendors call “AI routing.” Martian, Not Diamond, and Unify AI all sit in this category — they ship a trained router (sometimes per-customer) that classifies the query and picks the model.
Pattern 3 — Confidence-based escalation
A cascade. Send the query to a small model. If the small model’s output passes a confidence check, return it. If not, escalate to a larger model. Repeat.
The confidence signal varies:
- Logprob threshold. The small model’s output logprobs are low overall — escalate.
- Self-reported uncertainty. Ask the small model “are you confident in this answer? yes/no” and route on the answer. Lossy but easy.
- External verifier. A second cheap model checks the first model’s output; on a fail, escalate.
- Tool failure. The small model tries to call a tool, the tool errors or returns nonsense, escalate.
The classic production case is code-generation cascades. A 7B model writes the code; a unit test runs; on failure, the request escalates to a 70B model. Cursor’s apply model architecture includes this kind of cascade — a fast model attempts the edit, a verifier checks it, on failure a slower model retries.
Confidence escalation is the pattern that survives best under distribution shift — when users start asking new kinds of questions, the model that fails will fail and escalate, instead of a static classifier mis-routing forever. The downside is tail latency: a query that needs three escalations pays the full latency of three model calls.
Pattern 4 — Price-based / multi-provider
Orthogonal to patterns 1-3. Whatever model you decided to use, which provider’s instance of that model will you use? The same Llama-3.3-70B runs on Together AI ($0.88/M), Fireworks ($0.90/M), Groq ($0.59/M), and ten others, with different latency and uptime profiles.
OpenRouter is the canonical implementation: one API key, 200+ models, automatic fallbacks. Their pricing model is no-markup on token rates with a 5.5% platform fee on credit purchases. Portkey sits in the same category but with more enterprise features — conditional routing on request metadata, semantic caching, guardrails, observability. LiteLLM is the self-hosted OpenAI-compatible proxy for the same purpose.
Most production teams use one of these as the bottom layer of their routing stack: they make the model decision via patterns 1-3, then call the multi-provider gateway, which handles fallbacks, retries, and provider-level price arbitrage.
A real production case study
A customer support agent for a SaaS company — names redacted because the public version of this story is in pieces across multiple blog posts and the numbers are rounded — handling roughly 50,000 conversations per day.
The legacy stack: all queries to Sonnet, ~$15 per 1K conversations, ~$22,500/month inference spend. Quality was high. Spend was visible enough on the engineering AWS bill that someone asked the question.
The new stack:
- Rule-based first cut. “If the user is asking about pricing, route to a specialised pricing agent.” “If the user is asking about an order ID format, route to an order-lookup agent.” Both of these run on Haiku with a tight system prompt. About 35% of traffic.
- Model-based router for the rest. A small classifier (a fine-tuned Llama-3.1-8B doing 3-way classification: simple / medium / complex) sends each query to Haiku, Sonnet, or Opus respectively. Trained on ~10K labelled production queries with the human-rated “what model should have answered this” label.
- Final mix. 70% Haiku, 20% Sonnet, 10% Opus.
- Quality. A/B test against the legacy all-Sonnet stack on a held-out 5K conversation eval set, scored by human reviewers. Quality delta: -1.3% on overall satisfaction, statistically indistinguishable from noise.
- Cost. Weighted cost per 1K conversations dropped from $15 to ~$3, an 80% reduction.
The detail that mattered most: the classifier was trained on the team’s own production data, not a generic preference set. RouteLLM’s published numbers come from Chatbot Arena, which is a fine proxy but not your traffic. A router trained on your data, with the categories your support team actually cares about, beats a generic router by a meaningful margin every time. The Anyscale LLM router tutorial walks through how to build one.
The mistake teams keep making
Three patterns of failure I’ve watched happen, in roughly increasing order of avoidability:
- Routing without measurement. A team sets up Haiku-for-easy, Opus-for-hard, deploys, and assumes it’s working. Three months later they discover the router has been mis-classifying 30% of “easy” queries that actually needed Opus, and their customer satisfaction crater is unexplained. The fix: a continuously-running A/B-style eval that compares router decisions to a “what should the answer have been” ground-truth on a sampled fraction of traffic.
- Routing too late. The router runs after the embedding for RAG, after the system prompt is constructed, after the request is enriched with user context. By the time the router fires, you’ve already paid the latency. The fix: route as early as possible — ideally on the raw user input — and let the chosen model own context-building.
- Confusing routing with prompting. “Our Haiku-with-better-prompts beats GPT-4.” This is sometimes true, often not, and almost always overfits to a small eval set. Prompting is a per-model optimisation that lives inside the routing decision, not a substitute for it. The right framing is “routing picks the model, prompting tunes the model” — they compose.
The vendor consolidation that is happening
A working snapshot of the routing-vendor landscape, end of Q2 2026:
The interesting consolidation: the gateway category is winning the mindshare, the model-router-as-a-service category is struggling. The reason is that gateways are additive — you keep your existing model choices and add reliability/fallback — while routers are substitutive — you give up your model choice to a trained classifier. Most teams prefer to own the classifier (it’s cheaper to train than to license), so they end up running a homegrown router on top of a managed gateway.
Building a router on your data — the actual steps
Since this comes up in every conversation: what does “train a router on your own production data” look like in practice? The version that actually ships:
- Sample 5-10K production queries from your logs. Don’t filter; you want representative distribution.
- Hand-label or assisted-label. For each query, decide which model should have answered it. Three categories is plenty: small / medium / large. Have a senior eng or domain expert do the labelling.
- Train a classifier. Often a fine-tuned Llama-3.1-8B (3-way classification head) or even a logistic regression on embeddings is enough. The Anyscale tutorial uses gradient-boosted trees on Cohere embeddings, which works surprisingly well.
- Build the eval set. Hold out 1K labelled queries, never used in training. Score the trained router’s accuracy versus a baseline (e.g., always-largest).
- Deploy with a kill switch. The first version of the router runs in shadow mode — it predicts but doesn’t dispatch. Compare its predictions to your current routing for a week.
- Monitor drift. User questions change over time. Re-train every quarter or when accuracy on a sampled current-week subset drops below threshold.
The whole pipeline is a few weeks of engineering for one engineer. The payback is the 60-80% cost reduction. There’s no other line item in your inference budget with this return on effort.
What’s coming
Three things worth watching:
- Inference-platform routing. Fireworks, Together AI, and the hyperscalers are starting to ship “automatic model selection within our platform” features. The same API call gets the right model under the hood. This is convenient and probably correct for most users, but it removes your ability to tune.
- Per-customer routing. SaaS platforms with millions of users are starting to learn per-user routing patterns. User A talks to Haiku because she always asks simple questions. User B always escalates to Opus. The router is personalised.
- Cache-aware routing. As discussed in our KV cache management post, the next layer is routing not just by model but by which replica has the prefix already cached. This is the meta-layer that compounds with everything else.
Takeaway
Three lines, since people skim:
- Routing is the highest-leverage cost lever in inference. 80% bill reductions are routine and don’t need quality compromises.
- Compose patterns. Rule-based first cut for the obvious cases, model-based for the rest, multi-provider gateway underneath for fallback. Don’t bet on one pattern.
- Train the router on your own production data, not on a public benchmark. The generic router is the starting point; the personal one is the finish line.
The most quietly important shift of the last two years isn’t the new frontier model. It’s the realisation that the answer to “how do we serve LLMs cheaply at scale” was never “negotiate the per-token rate” — it was “don’t call the expensive model unless you have to.” The teams that have internalised this are running 80% cheaper than the teams that haven’t. That’s the entire game.
Further reading: the RouteLLM repo and paper are the canonical research starting point. OpenRouter docs and Portkey docs cover the gateway category; Anyscale’s LLM router tutorial walks through building your own. The Not Diamond blog has working numbers on per-customer routing wins.