datarekha

Model routing & cascades

Most queries are easy. Routing only the hard ones to an expensive model — and cascading cheap-first — cuts LLM cost 45–85% at near-equal quality. The pattern behind every cost-efficient LLM product.

7 min read Intermediate Generative AI Lesson 23 of 33

What you'll learn

  • Why sending every query to one big model wastes most of your budget
  • How complexity-based routing and cheap-first cascades work
  • The cost/quality tradeoff and where the sweet spot sits

Before you start

The default architecture — send every request to your best, most expensive model — is also the most wasteful one. Real traffic is mostly easy: greetings, simple lookups, short classifications. Spending frontier-model money on “what are your hours?” is like taking a taxi to your mailbox. Model routing fixes that by matching each query to the cheapest model that can handle it — and it’s one of the highest-leverage cost moves you can make.

The idea: right-size every query

Estimate each query’s difficulty, then dispatch:

  • Easy queries → a small, cheap, fast model.
  • Hard queries → the big frontier (or reasoning) model.

The router itself is usually a tiny classifier or a cheap LLM that scores complexity. Slide the threshold and watch what happens to cost and quality — notice that the hard queries are a small slice of the total:

Send everything to the frontier model (threshold at 0) and you get top quality at top cost — most of it wasted on easy queries. Send everything to the cheap model (threshold at 1) and cost collapses, but so does quality on the hard ones. The sweet spot routes only the hard fraction up: most of the cost saving, almost none of the quality loss.

Cascades: cheap-first, escalate on failure

A close cousin is the cascade: always try the cheap model first, and only escalate to the expensive one when the cheap answer fails a confidence or verification check. Because most queries pass at the cheap tier, you pay frontier prices only for the residual.

querycheap modelhandles mostconfident?accept ✓escalate → frontieryesno
A cascade: cheap model first, escalate only the queries that fail the confidence check.

Quick check

Quick check

0/3
Q1What is model routing?
Q2How does a cascade differ from a router?
Q3Why does routing save so much without much quality loss?

Next

Routing is the headline cost lever; it stacks with caching and the serving wins in KV cache & continuous batching. To decide which queries need a reasoning model, routing is exactly the mechanism.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
What is LLM model routing and how does an LLM cascade work?

Model routing sends each query to the most appropriate model based on difficulty, cost, or capability, instead of always using the largest model. A cascade is a sequential form: try the cheapest or smallest model first and only escalate to a larger model if the answer fails a quality or confidence check, reducing average cost while preserving quality on hard queries.

What techniques reduce LLM cost and latency in production?

Cost scales with input plus output tokens; latency scales with output tokens and model size. The highest-leverage levers are: model routing (use a small model when the task is simple), prompt caching (reuse expensive prefix computation), output length control, and batching. Together these can cut spend 60–90% without quality regression.

How would you reduce the cost of serving an ML or LLM model in production without hurting quality?

Work top-down: start at the model layer with quantization, distillation, or routing cheaper models for easy requests, since model choices drive every downstream cost. Then optimize the runtime with batching, caching, and techniques like prompt caching for LLMs, and finally match infrastructure to the load using autoscaling on queue depth and spot or batch capacity. Track cost per token or per prediction alongside latency percentiles and accuracy so optimizations never silently degrade quality.

How does LLMOps differ from classical MLOps, and what new operational challenges do LLMs introduce?

LLMOps extends classical MLOps to handle foundation model scale, prompt-based configuration, non-deterministic outputs, and evaluation without a scalar ground truth. Key new concerns include prompt versioning, output quality evaluation via LLM judges or human review, hallucination monitoring, cost management, and RAG pipeline observability.

Related lessons

Explore further

Skip to content