What is LLM model routing and how does an LLM cascade work?

Model routing sends each query to the most appropriate model based on difficulty, cost, or capability, instead of always using the largest model. A cascade is a sequential form: try the cheapest or smallest model first and only escalate to a larger model if the answer fails a quality or confidence check, reducing average cost while preserving quality on hard queries.

What techniques reduce LLM cost and latency in production?

Cost scales with input plus output tokens; latency scales with output tokens and model size. The highest-leverage levers are: model routing (use a small model when the task is simple), prompt caching (reuse expensive prefix computation), output length control, and batching. Together these can cut spend 60–90% without quality regression.

How would you reduce the cost of serving an ML or LLM model in production without hurting quality?

Work top-down: start at the model layer with quantization, distillation, or routing cheaper models for easy requests, since model choices drive every downstream cost. Then optimize the runtime with batching, caching, and techniques like prompt caching for LLMs, and finally match infrastructure to the load using autoscaling on queue depth and spot or batch capacity. Track cost per token or per prediction alongside latency percentiles and accuracy so optimizations never silently degrade quality.

How does LLMOps differ from classical MLOps, and what new operational challenges do LLMs introduce?

LLMOps extends classical MLOps to handle foundation model scale, prompt-based configuration, non-deterministic outputs, and evaluation without a scalar ground truth. Key new concerns include prompt versioning, output quality evaluation via LLM judges or human review, hallucination monitoring, cost management, and RAG pipeline observability.

Model routing & cascades — Generative AI

The default architecture — send every request to your best, most expensive model — is also the most wasteful one. Real traffic is mostly easy: greetings, simple lookups, short classifications. Spending frontier-model money on “what are your hours?” is like taking a taxi to your mailbox. Model routing fixes that by matching each query to the cheapest model that can handle it — and it’s one of the highest-leverage cost moves you can make.

TryModel routing · cost vs quality

Don't send every query to the expensive model

24 queries of varying difficulty. The router sends easy ones to a cheap model and hard ones to the frontier model. Slide the complexity threshold and watch cost and quality trade off. Most traffic is easy — so routing only the hard fraction up saves a lot.

cheap model frontier modelbar height = query difficulty

complexity threshold 0.60

cost saved59%

total cost19.5¢

avg quality86%

→ frontier38%

This is the sweet spot. Routing only the hardest 38% to the frontier model keeps average quality at 86% while cutting cost 59%. Cascades and semantic caching push this even further.

The idea: right-size every query

Estimate each query’s difficulty, then dispatch:

Easy queries → a small, cheap, fast model.
Hard queries → the big frontier (or reasoning) model.

The router itself is usually a tiny classifier or a cheap LLM that scores complexity. The single fact that makes routing pay off is that difficulty is skewed — most real traffic is easy, and only a thin slice is genuinely hard:

Now slide a routing threshold across that distribution. Send everything to the frontier model (threshold at 0) and you get top quality at top cost — most of it wasted on the easy bars. Send everything to the cheap model (threshold at 1) and cost collapses, but quality sags on the hard ones. The sweet spot routes only the small hard fraction up: most of the cost saving, almost none of the quality loss. The worked numbers below make that concrete.

Cascades: cheap-first, escalate on failure

A close cousin is the cascade: always try the cheap model first, and only escalate to the expensive one when the cheap answer fails a confidence or verification check. Because most queries pass at the cheap tier, you pay frontier prices only for the residual.

A cascade: cheap model first, escalate only the queries that fail the confidence check.

# 1000 queries as a fixed difficulty histogram, skewed toward easy.
# (complexity, count) — same distribution as the bar chart above.
buckets = [(0.1, 400), (0.3, 250), (0.5, 200), (0.7, 100), (0.9, 50)]
N = sum(c for _, c in buckets)              # 1000
COST_CHEAP, COST_EXP = 0.1, 2.0             # cents per query

def cheap_quality(x):
    # the cheap model's quality degrades on harder queries
    return min(1.0, max(0.45, 0.97 - 0.55 * x))

def run(threshold):
    cost = q_sum = to_exp = 0
    for x, count in buckets:
        if x >= threshold:                  # route up to the frontier model
            cost += count * COST_EXP
            q_sum += count * 0.95
            to_exp += count
        else:                               # keep it on the cheap model
            cost += count * COST_CHEAP
            q_sum += count * cheap_quality(x)
    return cost, q_sum / N, to_exp / N

all_exp_cost = N * COST_EXP                  # everything to the frontier model
for t in [0.0, 0.4, 0.6, 0.8, 1.0]:
    cost, q, frac = run(t)
    saved = (1 - cost / all_exp_cost) * 100
    print(f"threshold {t:.1f}: cost {cost:.0f}c, quality {q*100:.0f}%, "
          f"{frac*100:.0f}% routed up, saved {saved:.0f}%")

threshold 0.0: cost 2000c, quality 95%, 100% routed up, saved 0%
threshold 0.4: cost 765c, quality 90%, 35% routed up, saved 62%
threshold 0.6: cost 385c, quality 85%, 15% routed up, saved 81%
threshold 0.8: cost 195c, quality 81%, 5% routed up, saved 90%
threshold 1.0: cost 100c, quality 79%, 0% routed up, saved 95%

Read the middle rows. At a 0.6 threshold you route just 15% of queries to the frontier model, yet keep quality at 85% while cutting cost 81% — from 2000c down to 385c. Push the threshold to 1.0 (cheap model for everything) and you save 95% but quality slumps to 79% as hard queries get under-served. The knee of that curve — big saving, small quality hit — is exactly the sweet spot the skewed distribution hands you.

SLMs as the first layer

Routing isn’t only “cheap LLM vs frontier LLM.” A common production shape puts a small language model (SLM) — a 1–8B model, often self-hosted — as the very first layer. SLMs are fast and cheap enough to handle the bulk of traffic: intent classification, routing decisions, summarisation, extraction. Anything that needs deep reasoning (risk analysis, multi-step planning) escalates to the large model. In a finance pipeline, the SLM parses and summarises every document; only the genuinely complex cases reach the frontier model. You pay big-model prices for the long tail, not the firehose.

LiteLLM: one gateway for many models

The moment you route across several providers — a local SLM via Ollama, plus hosted APIs — you hit a wall of incompatible SDKs and auth schemes. A gateway like LiteLLM solves this: it exposes one OpenAI-compatible interface in front of 100+ providers, so your app calls a single endpoint and the gateway translates. Crucially, the routing and cascade logic lives there, not scattered through your code — and so do the cross-cutting concerns: per-key rate limits, spend tracking, retries, and fallbacks (if the primary model errors or times out, transparently retry on another). It turns “route between models” from app plumbing into one configurable layer.

In one breath

Sending every query to your best model is the most wasteful default — real traffic is mostly easy.
Routing scores each query’s difficulty up front and dispatches: cheap model for the easy majority, frontier model only for the hard slice.
A cascade instead tries cheap-first and escalates only when the answer fails a confidence check, so you pay frontier prices only for the residual.
Because difficulty is skewed toward easy, routing the small hard fraction up cuts cost 45–85% while keeping ~95% of quality.
In production, put a small language model as the first layer and a gateway (LiteLLM) to hold the routing, fallbacks, and spend tracking in one place.

Quick check

0/3

Q1What is model routing?

Q2How does a cascade differ from a router?

Q3Why does routing save so much without much quality loss?

Routing is the headline cost lever; it stacks with caching and the serving wins in KV cache & continuous batching. To decide which queries need a reasoning model, routing is exactly the mechanism.

Model routing & cascades

What you'll learn

Before you start

Don't send every query to the expensive model

The idea: right-size every query

Cascades: cheap-first, escalate on failure

SLMs as the first layer

LiteLLM: one gateway for many models

In one breath

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further