Cost & latency control
Agents make many model calls per task, so cost and latency multiply fast. The levers that cut agent spend 40–70%: routing, caching, token budgets, context pruning, and step limits.
What you'll learn
- Why agents are expensive — many LLM calls per task, multiplied by users
- The cost levers — routing, caching, budgets, context pruning, step limits
- How to set guardrails so an agent can't become a runaway bill
Before you start
A single LLM call has a known, bounded cost. An agent makes many calls per task — 3 to 10 is normal, more with retries and reflection — and you multiply that by every user. The economics that were fine in a demo can be brutal in production. The good news: cost is very controllable once you know the levers, and teams routinely cut agent spend 40–70% without hurting quality.
Why agents are expensive
- Many calls per task — every reasoning step, tool decision, and reflection is a model call.
- Growing context — each turn carries the accumulated history, so later calls in a run are bigger (and pricier) than earlier ones — the context engineering problem.
- Loops and retries — a poorly-bounded agent can repeat work, and a retry on a large context is expensive.
The levers, in order of impact
- Model routing — don’t use a frontier model for every step. Use a small cheap model for routing, classification, and simple sub-tasks; reserve the expensive (or reasoning) model for the genuinely hard ones. This is the biggest lever — see model routing.
- Caching — prompt caching for the stable system/tool prefix (a big discount), and semantic caching to skip the model entirely for repeat questions.
- Context pruning / compaction — keep the window lean so later calls don’t balloon; code execution keeps bulk data out-of-context entirely.
- Token & step budgets — cap the reasoning/output tokens per call and the number of steps per run. This bounds the worst case.
- Right-size the loop — a Plan-and-Execute agent makes fewer planning calls than a chatty ReAct loop for a known task; pick the cheapest loop the task needs.
Quick check
Quick check
Next
That completes the production toolkit — evaluation, observability, context engineering, and cost control — for shipping agents that are reliable and affordable.
Practice this in an interview
All questionsCost scales with input plus output tokens; latency scales with output tokens and model size. The highest-leverage levers are: model routing (use a small model when the task is simple), prompt caching (reuse expensive prefix computation), output length control, and batching. Together these can cut spend 60–90% without quality regression.
Work top-down: start at the model layer with quantization, distillation, or routing cheaper models for easy requests, since model choices drive every downstream cost. Then optimize the runtime with batching, caching, and techniques like prompt caching for LLMs, and finally match infrastructure to the load using autoscaling on queue depth and spot or batch capacity. Track cost per token or per prediction alongside latency percentiles and accuracy so optimizations never silently degrade quality.
Apply FinOps to ML by tagging every workload (training jobs, endpoints, GPU pools) by team, model, and environment so cost is attributable, then track unit-economics metrics like cost per prediction or per training run rather than just total spend. Set budgets and alerts, identify idle GPUs and overprovisioned endpoints, and enforce guardrails like autoscaling and instance-type policies. The goal is continuous visibility and accountability so teams optimize cost without killing experimentation.
Agents use short-term memory (the working context window) and long-term memory stored in vector databases or files, often split into episodic, semantic, and procedural memory. Context engineering is the discipline of curating what goes into the limited context window, and compaction summarizes or prunes older history so the agent retains key information without overflowing the window or degrading from too much noise.