datarekha

Cost & latency control

Agents make many model calls per task, so cost and latency multiply fast. The levers that cut agent spend 40–70%: routing, caching, token budgets, context pruning, and step limits.

7 min read Intermediate Agentic AI Lesson 38 of 42

What you'll learn

  • Why agents are expensive — many LLM calls per task, multiplied by users
  • The cost levers — routing, caching, budgets, context pruning, step limits
  • How to set guardrails so an agent can't become a runaway bill

Before you start

A single LLM call has a known, bounded cost. An agent makes many calls per task — 3 to 10 is normal, more with retries and reflection — and you multiply that by every user. The economics that were fine in a demo can be brutal in production. The good news: cost is very controllable once you know the levers, and teams routinely cut agent spend 40–70% without hurting quality.

Why agents are expensive

  • Many calls per task — every reasoning step, tool decision, and reflection is a model call.
  • Growing context — each turn carries the accumulated history, so later calls in a run are bigger (and pricier) than earlier ones — the context engineering problem.
  • Loops and retries — a poorly-bounded agent can repeat work, and a retry on a large context is expensive.

The levers, in order of impact

  • Model routing — don’t use a frontier model for every step. Use a small cheap model for routing, classification, and simple sub-tasks; reserve the expensive (or reasoning) model for the genuinely hard ones. This is the biggest lever — see model routing.
  • Cachingprompt caching for the stable system/tool prefix (a big discount), and semantic caching to skip the model entirely for repeat questions.
  • Context pruning / compaction — keep the window lean so later calls don’t balloon; code execution keeps bulk data out-of-context entirely.
  • Token & step budgets — cap the reasoning/output tokens per call and the number of steps per run. This bounds the worst case.
  • Right-size the loop — a Plan-and-Execute agent makes fewer planning calls than a chatty ReAct loop for a known task; pick the cheapest loop the task needs.
baseline cost+ routing+ caching+ prunefinaleach lever stacks — together they commonly cut agent spend 40–70%
Cost levers stack: routing, caching, and context pruning together cut agent spend substantially.

Quick check

Quick check

0/3
Q1Why do agents cost so much more than a single LLM call?
Q2Which lever typically has the biggest impact on agent cost?
Q3What's the right cost metric for an agent?

Next

That completes the production toolkit — evaluation, observability, context engineering, and cost control — for shipping agents that are reliable and affordable.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
What techniques reduce LLM cost and latency in production?

Cost scales with input plus output tokens; latency scales with output tokens and model size. The highest-leverage levers are: model routing (use a small model when the task is simple), prompt caching (reuse expensive prefix computation), output length control, and batching. Together these can cut spend 60–90% without quality regression.

How would you reduce the cost of serving an ML or LLM model in production without hurting quality?

Work top-down: start at the model layer with quantization, distillation, or routing cheaper models for easy requests, since model choices drive every downstream cost. Then optimize the runtime with batching, caching, and techniques like prompt caching for LLMs, and finally match infrastructure to the load using autoscaling on queue depth and spot or batch capacity. Track cost per token or per prediction alongside latency percentiles and accuracy so optimizations never silently degrade quality.

How do you attribute and control ML spend across teams and models (FinOps for ML)?

Apply FinOps to ML by tagging every workload (training jobs, endpoints, GPU pools) by team, model, and environment so cost is attributable, then track unit-economics metrics like cost per prediction or per training run rather than just total spend. Set budgets and alerts, identify idle GPUs and overprovisioned endpoints, and enforce guardrails like autoscaling and instance-type policies. The goal is continuous visibility and accountability so teams optimize cost without killing experimentation.

What types of memory do agents use, and what is context engineering and compaction?

Agents use short-term memory (the working context window) and long-term memory stored in vector databases or files, often split into episodic, semantic, and procedural memory. Context engineering is the discipline of curating what goes into the limited context window, and compaction summarizes or prunes older history so the agent retains key information without overflowing the window or degrading from too much noise.

Related lessons

Explore further

Skip to content