NLP & LLMs Medium Asked at OpenAIAsked at AnthropicAsked at GoogleAsked at Databricks

What techniques reduce LLM cost and latency in production?

For AI / LLM Engineer ML Engineer Data Engineer

The short answer

Cost scales with input plus output tokens; latency scales with output tokens and model size. The highest-leverage levers are: model routing (use a small model when the task is simple), prompt caching (reuse expensive prefix computation), output length control, and batching. Together these can cut spend 60–90% without quality regression.

How to think about it

Token economics

Every API call costs (input_tokens + output_tokens) x price_per_token. Output tokens dominate latency because they are generated sequentially. Input tokens are processed in parallel, so a long system prompt costs money but adds little latency — unless you repeat it every call.

Top techniques

1. Model routing / cascading

Classify queries by complexity and route simple ones to a smaller model (e.g., GPT-4o-mini or Claude Haiku) and only escalate to a frontier model when needed. A binary classifier on query embeddings can make this decision in under 5 ms.

2. Prompt caching

Anthropic and OpenAI both support prefix caching. Keep the static portion of your prompt (system prompt + few-shot examples + retrieved docs) at the front, and the dynamic portion (user message) at the end. Cached tokens are billed at ~10–20% of normal input cost.

# Anthropic prompt caching — mark the expensive prefix
import anthropic

client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=512,
    system=[
        {
            "type": "text",
            "text": very_long_system_prompt,
            "cache_control": {"type": "ephemeral"},  # cache this prefix
        }
    ],
    messages=[{"role": "user", "content": user_query}],
)

3. Output length control

Set max_tokens to the minimum needed. Use structured output (JSON schema) to prevent verbose filler text. Instruct the model to be concise in the system prompt.

4. Streaming

Streaming does not reduce total tokens but improves perceived latency by letting the UI render as tokens arrive — critical for chat interfaces.

5. Batching offline workloads

Anthropic’s Message Batches API and OpenAI’s Batch API process asynchronous requests at ~50% cost reduction for non-real-time workloads like document processing pipelines.

6. Quantized local models

For high-volume, latency-sensitive workloads that do not need frontier quality, a 4-bit quantized 70B model on two A100s can match GPT-4o-mini quality at a fraction of the API cost at scale.

What techniques reduce LLM cost and latency in production?

Keep practising

Explore further