What is a KV cache and how does it speed up LLM inference?

During autoregressive generation, attention recomputes Keys and Values for all previous tokens at every step; the KV cache stores those K and V tensors so each new token only computes its own, turning per-step cost from quadratic to linear in sequence length. The tradeoff is memory growth proportional to sequence length and batch size.

What techniques reduce LLM cost and latency in production?

Cost scales with input plus output tokens; latency scales with output tokens and model size. The highest-leverage levers are: model routing (use a small model when the task is simple), prompt caching (reuse expensive prefix computation), output length control, and batching. Together these can cut spend 60–90% without quality regression.

When the KV cache doesn't fit in GPU VRAM, what are your options?

The KV cache is working memory — it's re-read to generate every token — so it has to stay fast. When VRAM fills, you offload the least-active sessions down a memory hierarchy: GPU VRAM (active, ~3 TB/s), CPU RAM over PCIe (idle, ~50 GB/s), local SSD (long-idle), and networked storage (cold/durable only, never live decode). Idle sessions are parked lower and promoted back to VRAM on activity. The alternative is to drop the cache and recompute the prefill when the session returns; for long prompts, offloading and reloading usually beats recomputing attention over thousands of tokens.

What is the KV cache in a transformer and why does it matter for inference?

The KV cache stores the key and value tensors computed during previous forward passes so they do not need to be recomputed for every new token during autoregressive generation. Without it, generating each token would require a full forward pass over the entire context from scratch, making inference cost grow quadratically with sequence length rather than linearly.

Caching: exact, semantic & prompt — Generative AI

The basic idea

A cache sits between your application and the expensive operation (an LLM call, an embedding API call). On a hit, you return a stored result instantly. On a miss, you do the real work, store the result, and return it. The art is in choosing what to use as the key.

The basic cache loop: a hit returns the stored result instantly; a miss calls the real API, pays the cost, then stores the result for next time.

The critical variable is hit rate. A cache with a 10% hit rate barely moves the needle. A cache with a 90% hit rate cuts your bill by 90%. Everything that follows is in service of achieving the highest hit rate for a given level of correctness.

Cache 1: Exact-match

Key: hash(model + params + exact input text)

Value: the stored response

Lookup: O(1) — a single hash lookup in Redis or an in-memory dictionary.

If the input bytes are identical, the key is identical, and you get a hit. That’s it. No vectors, no machine learning.

Why embeddings are the perfect candidate

An embedding model is deterministic: the same text plus the same model always produces the same vector. There is no temperature, no randomness. Once you’ve computed embed("What is your refund policy?") with text-embedding-3-small, that exact 1536-dimensional vector will never change — unless the model version changes.

This means you can cache embeddings indefinitely (until the model is updated). In a RAG system that embeds thousands of documents, or in a chatbot that embeds the same query variants repeatedly, this collapses re-computation to a hashtable lookup.

import hashlib, json

def make_embed_key(text: str, model: str) -> str:
    payload = json.dumps({"text": text, "model": model}, sort_keys=True)
    return "embed:" + hashlib.sha256(payload.encode()).hexdigest()

# On a cache miss you call the API; on a hit you skip it entirely.

Exact-match also works well for LLM completions at temperature=0 — same prompt, same output every time. For anything with temperature above 0 the output is stochastic, so caching may return a stale response instead of a freshly sampled one; decide based on whether freshness matters for your use case.

The limit: it is byte-sensitive. “Reset my password” and “reset my password ” (trailing space) are different keys. Paraphrases never hit.

Cache 2: Semantic cache

Key concept: match by meaning, not by exact bytes.

Instead of hashing the input, you embed it and store it in a vector index alongside its cached answer. On a new request, you embed the incoming query, search the index for the nearest stored query, and if the cosine similarity is above a threshold, return the stored answer.

Left: exact-match hashes the bytes — paraphrases always miss. Right: semantic cache embeds the query and does a vector search — the paraphrase hits because its cosine similarity to the stored query is above the threshold.

This catches paraphrases that exact-match misses. “Reset password” and “how do I change my password” embed close together; a threshold of cosine > 0.95 is tight enough that they hit each other while a different question stays below the threshold and misses.

# Pseudocode — semantic cache lookup
def semantic_cache_get(query: str, model: str, threshold: float = 0.95):
    q_vec = embed(query, model)                   # 1 embedding call
    nearest, score = vector_index.search(q_vec, k=1)
    if score >= threshold:
        return cache_store.get(nearest.key)       # hit
    return None                                   # miss — caller must call LLM

Cache 3: Provider-side prompt caching

This one works differently. It lives inside the model provider (Anthropic, OpenAI, or an open-source inference server like vLLM) and caches not the final answer but the computed KV state of a long, stable prefix.

What is KV cache?

When a transformer processes a sequence of tokens, it computes key-value attention tensors for each token. For a 10,000-token system prompt, that computation is expensive — and it is identical for every request that uses the same system prompt. Provider-side prompt caching stores those KV tensors so that subsequent calls with the same prefix skip recomputing them.

You benefit from lower latency and lower input token cost. Anthropic charges 10% of the normal input token price for cache hits on the stable prefix, and near-zero additional latency for that portion.

How to exploit it

Structure your prompt so the stable part comes first and the variable part comes last:

[system prompt]          <-- stable, maybe 1,000 tokens
[shared document/tools]  <-- stable, maybe 8,000 tokens
[few-shot examples]      <-- stable, maybe 500 tokens
[user message]           <-- variable, changes every call

The provider caches KV for everything up to the last stable token. The user message is always recomputed — but it is small.

The rule: keep the prefix byte-stable. Any change to a single character in the prefix invalidates the cache for everything after that point. Do not inject timestamps, request IDs, or user-specific data into the prefix — put those in the suffix.

Prompt caching: a long, byte-stable prefix gets its KV tensors cached by the provider. Only the small variable suffix is recomputed on each call. Put the stable material first, user-specific content last.

Unlike exact-match or semantic caching, prompt caching does not return a cached answer. It returns a cached intermediate computation. The model still generates a fresh response from the cached prefix + new suffix — so the answer is always correct for the new input.

Cache invalidation (the hard part)

“There are only two hard things in computer science: cache invalidation and naming things.” — Phil Karlton

The three failure modes to prevent:

1. Stale model responses

If you upgrade the embedding model from text-embedding-ada-002 to text-embedding-3-large, the old cached embeddings are wrong — they live in a different vector space. Always include the model name and version in the cache key. A simple version tag like v=2 in the key buys you a clean namespace when you upgrade.

key = f"embed:{model_name}:{model_version}:{sha256(text)}"

2. Time-sensitive data

Cache entries should carry a TTL (time-to-live) proportional to how fast the underlying data changes. A “What is your refund policy?” answer can live for a week; a “What is the current price of X?” answer should live for seconds or not at all. Design your TTL by asking: if I served this answer 24 hours from now, would it be wrong?

3. Cross-user cache leakage

A robust key design:

# For a user-scoped LLM response
key = f"llm:{tenant_id}:{user_id}:{model}:{version}:{sha256(prompt)}"

For embeddings that are not user-specific (e.g. embedding a shared knowledge base document), you can omit the user ID — the embedding does not depend on who is asking.

The three caches at a glance

	Exact-match	Semantic	Prompt (provider)
What’s cached	Full response	Full response	KV tensors of prefix
Key type	Hash of input bytes	Nearest vector in index	Exact byte prefix
Hit condition	Identical input	Cosine above threshold	Identical prefix
Latency on hit	~1 ms	~5–20 ms (vector search)	Lower generation latency
Cost on hit	~$0	~$0 + 1 embed call	~10% of prefix tokens
Best for	Embeddings, temp=0 LLM	Paraphrase-heavy Q&A	Long shared system prompts
Main risk	Low hit rate for varied input	False hit if threshold too low	Prefix must stay byte-stable

Putting it together

For the “same queries over and over” scenario, the playbook is:

Add exact-match caching first. It is O(1), has zero false-hit risk, and in a chatbot with repeated queries the hit rate is often 60–80% with no tuning.
Layer semantic caching on top for the paraphrase tail. Start with a high threshold (0.95+) and lower it cautiously while monitoring for false hits.
Enable provider prompt caching for your long system prompt and tool definitions. This is nearly free to adopt — just keep the prefix stable.
Version your cache keys on model name. Set TTLs. Namespace by user for any personalized response.

Quick check

0/3

Q1An embedding model always returns the same vector for the same input text and model. What does this imply for caching?

Q2You lower the semantic cache similarity threshold from 0.95 to 0.80 to raise your hit rate. What is the danger?

Q3Your multi-tenant SaaS product serves 200 enterprise customers. Each customer's data is isolated. You add an LLM-powered Q&A feature and cache LLM responses by hashing the prompt. What is the critical mistake?

Caching: exact, semantic & prompt

What you'll learn

Before you start