What techniques reduce LLM cost and latency in production?

Cost scales with input plus output tokens; latency scales with output tokens and model size. The highest-leverage levers are: model routing (use a small model when the task is simple), prompt caching (reuse expensive prefix computation), output length control, and batching. Together these can cut spend 60–90% without quality regression.

How would you reduce the cost of serving an ML or LLM model in production without hurting quality?

Work top-down: start at the model layer with quantization, distillation, or routing cheaper models for easy requests, since model choices drive every downstream cost. Then optimize the runtime with batching, caching, and techniques like prompt caching for LLMs, and finally match infrastructure to the load using autoscaling on queue depth and spot or batch capacity. Track cost per token or per prediction alongside latency percentiles and accuracy so optimizations never silently degrade quality.

How do you attribute and control ML spend across teams and models (FinOps for ML)?

Apply FinOps to ML by tagging every workload (training jobs, endpoints, GPU pools) by team, model, and environment so cost is attributable, then track unit-economics metrics like cost per prediction or per training run rather than just total spend. Set budgets and alerts, identify idle GPUs and overprovisioned endpoints, and enforce guardrails like autoscaling and instance-type policies. The goal is continuous visibility and accountability so teams optimize cost without killing experimentation.

What types of memory do agents use, and what is context engineering and compaction?

Agents use short-term memory (the working context window) and long-term memory stored in vector databases or files, often split into episodic, semantic, and procedural memory. Context engineering is the discipline of curating what goes into the limited context window, and compaction summarizes or prunes older history so the agent retains key information without overflowing the window or degrading from too much noise.

Cost & latency control — Agentic AI

A single LLM call has a known, bounded cost. An agent makes many calls per task — 3 to 10 is normal, more with retries and reflection — and you multiply that by every user. The economics that were fine in a demo can be brutal in production. The good news: cost is very controllable once you know the levers, and teams routinely cut agent spend 40–70% without hurting quality.

Why agents are expensive

Many calls per task — every reasoning step, tool decision, and reflection is a model call.
Growing context — each turn carries the accumulated history, so later calls in a run are bigger (and pricier) than earlier ones — the context engineering problem.
Loops and retries — a poorly-bounded agent can repeat work, and a retry on a large context is expensive.

The levers, in order of impact

Model routing — don’t use a frontier model for every step. Use a small cheap model for routing, classification, and simple sub-tasks; reserve the expensive (or reasoning) model for the genuinely hard ones. This is the biggest lever — see model routing.
Caching — prompt caching for the stable system/tool prefix (a big discount), and semantic caching to skip the model entirely for repeat questions.
Context pruning / compaction — keep the window lean so later calls don’t balloon; code execution keeps bulk data out-of-context entirely.
Token & step budgets — cap the reasoning/output tokens per call and the number of steps per run. This bounds the worst case.
Right-size the loop — a Plan-and-Execute agent makes fewer planning calls than a chatty ReAct loop for a known task; pick the cheapest loop the task needs.

Cost levers stack: routing, caching, and context pruning together cut agent spend substantially.

In one breath

A single LLM call is bounded, but an agent makes many calls per task (3–10+, more with retries/reflection) × every user — demo economics turn brutal in production.
Agents are costly from many calls, growing context each turn, and loops/retries on large contexts.
Levers, by impact: model routing (cheap model for easy steps — the biggest one), caching (prompt + semantic), context pruning/compaction, token & step budgets, right-sizing the loop — and they stack to cut spend 40–70%.
Bound the worst case: every agent needs a hard step budget and a per-run token/dollar ceiling with a graceful fallback — an uncapped reflection loop is a runaway-bill incident.
Optimize cost per completed task, not per call — a cheap model that fails and retries can cost more than the expensive one that succeeds first try; tie cost to eval outcomes.

Quick check

0/3

Q1Why do agents cost so much more than a single LLM call?

Q2Which lever typically has the biggest impact on agent cost?

Q3What's the right cost metric for an agent?

That completes the production toolkit — evaluation, observability, context engineering, and cost control — for shipping agents that are reliable and affordable.

Cost & latency control

What you'll learn

Before you start

Why agents are expensive

The levers, in order of impact

In one breath

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further