datarekha

Caching: exact, semantic & prompt

Your embedding API keeps answering the same question. Three kinds of cache that cut LLM cost and latency — and the one that can silently return wrong answers.

9 min read Advanced Generative AI Lesson 22 of 24

What you'll learn

  • Exact-match caching and why embeddings are the perfect candidate
  • Semantic caching: matching by meaning with a vector search + threshold
  • Provider-side prompt caching: reusing the KV of a stable prefix
  • Cache invalidation: TTLs, versioning by model+params, and never leaking across users
  • When semantic caching returns a confidently wrong answer

Before you start

The basic idea

A cache sits between your application and the expensive operation (an LLM call, an embedding API call). On a hit, you return a stored result instantly. On a miss, you do the real work, store the result, and return it. The art is in choosing what to use as the key.

IncomingRequestCachelookup keyRedis / in-memoryHITReturn stored result$0 extra cost~1 ms latencyMISSCall LLM / API$$ cost, ~1–2 s latencyStore result in cache
The basic cache loop: a hit returns the stored result instantly; a miss calls the real API, pays the cost, then stores the result for next time.

The critical variable is hit rate. A cache with a 10% hit rate barely moves the needle. A cache with a 90% hit rate cuts your bill by 90%. Everything that follows is in service of achieving the highest hit rate for a given level of correctness.


Cache 1: Exact-match

Key: hash(model + params + exact input text)

Value: the stored response

Lookup: O(1) — a single hash lookup in Redis or an in-memory dictionary.

If the input bytes are identical, the key is identical, and you get a hit. That’s it. No vectors, no machine learning.

Why embeddings are the perfect candidate

An embedding model is deterministic: the same text plus the same model always produces the same vector. There is no temperature, no randomness. Once you’ve computed embed("What is your refund policy?") with text-embedding-3-small, that exact 1536-dimensional vector will never change — unless the model version changes.

This means you can cache embeddings indefinitely (until the model is updated). In a RAG system that embeds thousands of documents, or in a chatbot that embeds the same query variants repeatedly, this collapses re-computation to a hashtable lookup.

import hashlib, json

def make_embed_key(text: str, model: str) -> str:
    payload = json.dumps({"text": text, "model": model}, sort_keys=True)
    return "embed:" + hashlib.sha256(payload.encode()).hexdigest()

# On a cache miss you call the API; on a hit you skip it entirely.

Exact-match also works well for LLM completions at temperature=0 — same prompt, same output every time. For anything with temperature above 0 the output is stochastic, so caching may return a stale response instead of a freshly sampled one; decide based on whether freshness matters for your use case.

The limit: it is byte-sensitive. “Reset my password” and “reset my password ” (trailing space) are different keys. Paraphrases never hit.


Cache 2: Semantic cache

Key concept: match by meaning, not by exact bytes.

Instead of hashing the input, you embed it and store it in a vector index alongside its cached answer. On a new request, you embed the incoming query, search the index for the nearest stored query, and if the cosine similarity is above a threshold, return the stored answer.

Exact-match cache”reset my password”Incoming queryhash(model+params+text)deterministic keyRedis O(1) lookuphit only if bytes match”how do I change my password”✗ MISS — different bytesSemantic cache”how do I change my password”Incoming query (paraphrase)embed(query) → vectorembedding API callvector search → nearestcosine similarity scoredcosine ≥ threshold?e.g. 0.95✓ HIT — return stored answer
Left: exact-match hashes the bytes — paraphrases always miss. Right: semantic cache embeds the query and does a vector search — the paraphrase hits because its cosine similarity to the stored query is above the threshold.

This catches paraphrases that exact-match misses. “Reset password” and “how do I change my password” embed close together; a threshold of cosine > 0.95 is tight enough that they hit each other while a different question stays below the threshold and misses.

# Pseudocode — semantic cache lookup
def semantic_cache_get(query: str, model: str, threshold: float = 0.95):
    q_vec = embed(query, model)                   # 1 embedding call
    nearest, score = vector_index.search(q_vec, k=1)
    if score >= threshold:
        return cache_store.get(nearest.key)       # hit
    return None                                   # miss — caller must call LLM

Cache 3: Provider-side prompt caching

This one works differently. It lives inside the model provider (Anthropic, OpenAI, or an open-source inference server like vLLM) and caches not the final answer but the computed KV state of a long, stable prefix.

What is KV cache?

When a transformer processes a sequence of tokens, it computes key-value attention tensors for each token. For a 10,000-token system prompt, that computation is expensive — and it is identical for every request that uses the same system prompt. Provider-side prompt caching stores those KV tensors so that subsequent calls with the same prefix skip recomputing them.

You benefit from lower latency and lower input token cost. Anthropic charges 10% of the normal input token price for cache hits on the stable prefix, and near-zero additional latency for that portion.

How to exploit it

Structure your prompt so the stable part comes first and the variable part comes last:

[system prompt]          <-- stable, maybe 1,000 tokens
[shared document/tools]  <-- stable, maybe 8,000 tokens
[few-shot examples]      <-- stable, maybe 500 tokens
[user message]           <-- variable, changes every call

The provider caches KV for everything up to the last stable token. The user message is always recomputed — but it is small.

The rule: keep the prefix byte-stable. Any change to a single character in the prefix invalidates the cache for everything after that point. Do not inject timestamps, request IDs, or user-specific data into the prefix — put those in the suffix.

Prompt structure for prefix cachingSTABLE PREFIXSystem prompt + tool definitions + shared context documentMay be thousands of tokens. Identical across all requests in a session.KV cached≈10% token costappendVARIABLE SUFFIXUser message this turnSmall — always recomputed, always freshrecomputedfull token costKeep the prefix byte-stable — any change invalidates the KV cache from that point onward
Prompt caching: a long, byte-stable prefix gets its KV tensors cached by the provider. Only the small variable suffix is recomputed on each call. Put the stable material first, user-specific content last.

Unlike exact-match or semantic caching, prompt caching does not return a cached answer. It returns a cached intermediate computation. The model still generates a fresh response from the cached prefix + new suffix — so the answer is always correct for the new input.


Cache invalidation (the hard part)

“There are only two hard things in computer science: cache invalidation and naming things.” — Phil Karlton

The three failure modes to prevent:

1. Stale model responses

If you upgrade the embedding model from text-embedding-ada-002 to text-embedding-3-large, the old cached embeddings are wrong — they live in a different vector space. Always include the model name and version in the cache key. A simple version tag like v=2 in the key buys you a clean namespace when you upgrade.

key = f"embed:{model_name}:{model_version}:{sha256(text)}"

2. Time-sensitive data

Cache entries should carry a TTL (time-to-live) proportional to how fast the underlying data changes. A “What is your refund policy?” answer can live for a week; a “What is the current price of X?” answer should live for seconds or not at all. Design your TTL by asking: if I served this answer 24 hours from now, would it be wrong?

3. Cross-user cache leakage

A robust key design:

# For a user-scoped LLM response
key = f"llm:{tenant_id}:{user_id}:{model}:{version}:{sha256(prompt)}"

For embeddings that are not user-specific (e.g. embedding a shared knowledge base document), you can omit the user ID — the embedding does not depend on who is asking.


The three caches at a glance

Exact-matchSemanticPrompt (provider)
What’s cachedFull responseFull responseKV tensors of prefix
Key typeHash of input bytesNearest vector in indexExact byte prefix
Hit conditionIdentical inputCosine above thresholdIdentical prefix
Latency on hit~1 ms~5–20 ms (vector search)Lower generation latency
Cost on hit~$0~$0 + 1 embed call~10% of prefix tokens
Best forEmbeddings, temp=0 LLMParaphrase-heavy Q&ALong shared system prompts
Main riskLow hit rate for varied inputFalse hit if threshold too lowPrefix must stay byte-stable

Putting it together

For the “same queries over and over” scenario, the playbook is:

  1. Add exact-match caching first. It is O(1), has zero false-hit risk, and in a chatbot with repeated queries the hit rate is often 60–80% with no tuning.
  2. Layer semantic caching on top for the paraphrase tail. Start with a high threshold (0.95+) and lower it cautiously while monitoring for false hits.
  3. Enable provider prompt caching for your long system prompt and tool definitions. This is nearly free to adopt — just keep the prefix stable.
  4. Version your cache keys on model name. Set TTLs. Namespace by user for any personalized response.

Quick check

0/3
Q1An embedding model always returns the same vector for the same input text and model. What does this imply for caching?
Q2You lower the semantic cache similarity threshold from 0.95 to 0.80 to raise your hit rate. What is the danger?
Q3Your multi-tenant SaaS product serves 200 enterprise customers. Each customer's data is isolated. You add an LLM-powered Q&A feature and cache LLM responses by hashing the prompt. What is the critical mistake?

Practice this in an interview

All questions
What techniques reduce LLM cost and latency in production?

Cost scales with input plus output tokens; latency scales with output tokens and model size. The highest-leverage levers are: model routing (use a small model when the task is simple), prompt caching (reuse expensive prefix computation), output length control, and batching. Together these can cut spend 60–90% without quality regression.

What is the KV cache in a transformer and why does it matter for inference?

The KV cache stores the key and value tensors computed during previous forward passes so they do not need to be recomputed for every new token during autoregressive generation. Without it, generating each token would require a full forward pass over the entire context from scratch, making inference cost grow quadratically with sequence length rather than linearly.

What are embeddings and why are they central to modern deep learning?

An embedding is a dense, learned vector representation of a discrete or high-dimensional object — a word, image, user, product — in a continuous low-dimensional space. Proximity in embedding space reflects semantic or behavioural similarity, making embeddings a universal interface between raw data and neural networks.

How does caching and persist work in Spark, and when should you use each storage level?

cache() stores a DataFrame in executor memory using the default MEMORY_AND_DISK storage level. persist() lets you choose the storage level — memory-only, disk-only, serialized, or replicated. Use caching when a DataFrame is reused multiple times in the same application; without it, Spark recomputes the entire lineage from scratch on each action.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content