What techniques reduce LLM cost and latency in production?
Cost scales with input plus output tokens; latency scales with output tokens and model size. The highest-leverage levers are: model routing (use a small model when the task is simple), prompt caching (reuse expensive prefix computation), output length control, and batching. Together these can cut spend 60–90% without quality regression.
How to think about it
Token economics
Every API call costs (input_tokens + output_tokens) x price_per_token. Output tokens dominate latency because they are generated sequentially. Input tokens are processed in parallel, so a long system prompt costs money but adds little latency — unless you repeat it every call.
Top techniques
1. Model routing / cascading
Classify queries by complexity and route simple ones to a smaller model (e.g., GPT-4o-mini or Claude Haiku) and only escalate to a frontier model when needed. A binary classifier on query embeddings can make this decision in under 5 ms.
2. Prompt caching
Anthropic and OpenAI both support prefix caching. Keep the static portion of your prompt (system prompt + few-shot examples + retrieved docs) at the front, and the dynamic portion (user message) at the end. Cached tokens are billed at ~10–20% of normal input cost.
# Anthropic prompt caching — mark the expensive prefix
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=512,
system=[
{
"type": "text",
"text": very_long_system_prompt,
"cache_control": {"type": "ephemeral"}, # cache this prefix
}
],
messages=[{"role": "user", "content": user_query}],
)
3. Output length control
Set max_tokens to the minimum needed. Use structured output (JSON schema) to prevent verbose filler text. Instruct the model to be concise in the system prompt.
4. Streaming
Streaming does not reduce total tokens but improves perceived latency by letting the UI render as tokens arrive — critical for chat interfaces.
5. Batching offline workloads
Anthropic’s Message Batches API and OpenAI’s Batch API process asynchronous requests at ~50% cost reduction for non-real-time workloads like document processing pipelines.
6. Quantized local models
For high-volume, latency-sensitive workloads that do not need frontier quality, a 4-bit quantized 70B model on two A100s can match GPT-4o-mini quality at a fraction of the API cost at scale.