Caching: exact, semantic & prompt
Your embedding API keeps answering the same question. Three kinds of cache that cut LLM cost and latency — and the one that can silently return wrong answers.
What you'll learn
- Exact-match caching and why embeddings are the perfect candidate
- Semantic caching: matching by meaning with a vector search + threshold
- Provider-side prompt caching: reusing the KV of a stable prefix
- Cache invalidation: TTLs, versioning by model+params, and never leaking across users
- When semantic caching returns a confidently wrong answer
Before you start
The basic idea
A cache sits between your application and the expensive operation (an LLM call, an embedding API call). On a hit, you return a stored result instantly. On a miss, you do the real work, store the result, and return it. The art is in choosing what to use as the key.
The critical variable is hit rate. A cache with a 10% hit rate barely moves the needle. A cache with a 90% hit rate cuts your bill by 90%. Everything that follows is in service of achieving the highest hit rate for a given level of correctness.
Cache 1: Exact-match
Key: hash(model + params + exact input text)
Value: the stored response
Lookup: O(1) — a single hash lookup in Redis or an in-memory dictionary.
If the input bytes are identical, the key is identical, and you get a hit. That’s it. No vectors, no machine learning.
Why embeddings are the perfect candidate
An embedding model is deterministic: the same text plus the same model always produces the same vector. There is no temperature, no randomness. Once you’ve computed embed("What is your refund policy?") with text-embedding-3-small, that exact 1536-dimensional vector will never change — unless the model version changes.
This means you can cache embeddings indefinitely (until the model is updated). In a RAG system that embeds thousands of documents, or in a chatbot that embeds the same query variants repeatedly, this collapses re-computation to a hashtable lookup.
import hashlib, json
def make_embed_key(text: str, model: str) -> str:
payload = json.dumps({"text": text, "model": model}, sort_keys=True)
return "embed:" + hashlib.sha256(payload.encode()).hexdigest()
# On a cache miss you call the API; on a hit you skip it entirely.
Exact-match also works well for LLM completions at temperature=0 — same prompt, same output every time. For anything with temperature above 0 the output is stochastic, so caching may return a stale response instead of a freshly sampled one; decide based on whether freshness matters for your use case.
The limit: it is byte-sensitive. “Reset my password” and “reset my password ” (trailing space) are different keys. Paraphrases never hit.
Cache 2: Semantic cache
Key concept: match by meaning, not by exact bytes.
Instead of hashing the input, you embed it and store it in a vector index alongside its cached answer. On a new request, you embed the incoming query, search the index for the nearest stored query, and if the cosine similarity is above a threshold, return the stored answer.
This catches paraphrases that exact-match misses. “Reset password” and “how do I change my password” embed close together; a threshold of cosine > 0.95 is tight enough that they hit each other while a different question stays below the threshold and misses.
# Pseudocode — semantic cache lookup
def semantic_cache_get(query: str, model: str, threshold: float = 0.95):
q_vec = embed(query, model) # 1 embedding call
nearest, score = vector_index.search(q_vec, k=1)
if score >= threshold:
return cache_store.get(nearest.key) # hit
return None # miss — caller must call LLM
Cache 3: Provider-side prompt caching
This one works differently. It lives inside the model provider (Anthropic, OpenAI, or an open-source inference server like vLLM) and caches not the final answer but the computed KV state of a long, stable prefix.
What is KV cache?
When a transformer processes a sequence of tokens, it computes key-value attention tensors for each token. For a 10,000-token system prompt, that computation is expensive — and it is identical for every request that uses the same system prompt. Provider-side prompt caching stores those KV tensors so that subsequent calls with the same prefix skip recomputing them.
You benefit from lower latency and lower input token cost. Anthropic charges 10% of the normal input token price for cache hits on the stable prefix, and near-zero additional latency for that portion.
How to exploit it
Structure your prompt so the stable part comes first and the variable part comes last:
[system prompt] <-- stable, maybe 1,000 tokens
[shared document/tools] <-- stable, maybe 8,000 tokens
[few-shot examples] <-- stable, maybe 500 tokens
[user message] <-- variable, changes every call
The provider caches KV for everything up to the last stable token. The user message is always recomputed — but it is small.
The rule: keep the prefix byte-stable. Any change to a single character in the prefix invalidates the cache for everything after that point. Do not inject timestamps, request IDs, or user-specific data into the prefix — put those in the suffix.
Unlike exact-match or semantic caching, prompt caching does not return a cached answer. It returns a cached intermediate computation. The model still generates a fresh response from the cached prefix + new suffix — so the answer is always correct for the new input.
Cache invalidation (the hard part)
“There are only two hard things in computer science: cache invalidation and naming things.” — Phil Karlton
The three failure modes to prevent:
1. Stale model responses
If you upgrade the embedding model from text-embedding-ada-002 to text-embedding-3-large, the old cached embeddings are wrong — they live in a different vector space. Always include the model name and version in the cache key. A simple version tag like v=2 in the key buys you a clean namespace when you upgrade.
key = f"embed:{model_name}:{model_version}:{sha256(text)}"
2. Time-sensitive data
Cache entries should carry a TTL (time-to-live) proportional to how fast the underlying data changes. A “What is your refund policy?” answer can live for a week; a “What is the current price of X?” answer should live for seconds or not at all. Design your TTL by asking: if I served this answer 24 hours from now, would it be wrong?
3. Cross-user cache leakage
A robust key design:
# For a user-scoped LLM response
key = f"llm:{tenant_id}:{user_id}:{model}:{version}:{sha256(prompt)}"
For embeddings that are not user-specific (e.g. embedding a shared knowledge base document), you can omit the user ID — the embedding does not depend on who is asking.
The three caches at a glance
| Exact-match | Semantic | Prompt (provider) | |
|---|---|---|---|
| What’s cached | Full response | Full response | KV tensors of prefix |
| Key type | Hash of input bytes | Nearest vector in index | Exact byte prefix |
| Hit condition | Identical input | Cosine above threshold | Identical prefix |
| Latency on hit | ~1 ms | ~5–20 ms (vector search) | Lower generation latency |
| Cost on hit | ~$0 | ~$0 + 1 embed call | ~10% of prefix tokens |
| Best for | Embeddings, temp=0 LLM | Paraphrase-heavy Q&A | Long shared system prompts |
| Main risk | Low hit rate for varied input | False hit if threshold too low | Prefix must stay byte-stable |
Putting it together
For the “same queries over and over” scenario, the playbook is:
- Add exact-match caching first. It is O(1), has zero false-hit risk, and in a chatbot with repeated queries the hit rate is often 60–80% with no tuning.
- Layer semantic caching on top for the paraphrase tail. Start with a high threshold (0.95+) and lower it cautiously while monitoring for false hits.
- Enable provider prompt caching for your long system prompt and tool definitions. This is nearly free to adopt — just keep the prefix stable.
- Version your cache keys on model name. Set TTLs. Namespace by user for any personalized response.
Quick check
Practice this in an interview
All questionsCost scales with input plus output tokens; latency scales with output tokens and model size. The highest-leverage levers are: model routing (use a small model when the task is simple), prompt caching (reuse expensive prefix computation), output length control, and batching. Together these can cut spend 60–90% without quality regression.
The KV cache stores the key and value tensors computed during previous forward passes so they do not need to be recomputed for every new token during autoregressive generation. Without it, generating each token would require a full forward pass over the entire context from scratch, making inference cost grow quadratically with sequence length rather than linearly.
An embedding is a dense, learned vector representation of a discrete or high-dimensional object — a word, image, user, product — in a continuous low-dimensional space. Proximity in embedding space reflects semantic or behavioural similarity, making embeddings a universal interface between raw data and neural networks.
cache() stores a DataFrame in executor memory using the default MEMORY_AND_DISK storage level. persist() lets you choose the storage level — memory-only, disk-only, serialized, or replicated. Use caching when a DataFrame is reused multiple times in the same application; without it, Spark recomputes the entire lineage from scratch on each action.