Cache the question, not just the bytes
Your embedding API answers the same query a thousand times a day. Three caches cut LLM cost and latency — and one of them can hand back a confidently wrong answer.
The dashboard didn’t look alarming at first. Just a tidy table of API calls, timestamped, grouped by endpoint. But when the team on the observability side sorted by frequency instead of time, something ugly came into focus: the top fifty queries to their embedding service were identical strings, repeating in thousand-round cycles across the day. “What are your business hours?” “How do I reset my password?” “Where is my order?” The same ten questions, over and over, each one triggering a paid round-trip to the embedding API, each one returning the exact same 1,536-dimensional float array it returned the time before — and the time before that. Nothing was being learned. Nothing was changing. Money was simply leaving the account.
That’s where most teams discover they have a caching problem. Not in architecture review, not in a post-mortem. In a billing alert.
Cache One: Exact Match
The most underappreciated fact about embedding models is that they are deterministic for a fixed input and model version. Feed the same text into text-embedding-3-small today and tomorrow, and you get back bit-identical output. There is no temperature, no sampling, no stochastic component. The model is a pure function.
That makes exact-match caching almost embarrassingly easy. Hash the concatenation of the input text and the model identifier — something like SHA256(model_id + "|" + text) — store the resulting vector in Redis with that hash as the key, and serve it on the next identical request without ever touching the API. Hit rate alone determines ROI. For the support-bot case above, a hit rate above 80% is common within the first week of production traffic; repeated navigational queries dominate real usage.
The same logic applies to LLM completions when the prompt is fully deterministic — temperature zero, no sampling, no injected randomness. If you’re running classification pipelines, intent detection, or routing layers where the prompt template is fixed and the user input is bounded, you can cache the entire response. The key is that your cache entry is only as fresh as your prompt template: the moment you touch the template, the old cache entries are stale. We’ll come back to that.
Cache Two: Semantic Match
Real users don’t type the same string twice. They paraphrase, abbreviate, add typos, use synonyms. “Reset my password” and “how do I change my password?” and “forgot password help” are semantically equivalent — any of them should route to the same cached answer. Exact hashing misses all of them.
Semantic caching handles this by embedding the incoming query, then doing a nearest-neighbor lookup against the set of already-cached queries. If the top hit has a cosine similarity above some threshold, you return the stored answer for that cached query instead of generating a new one. The incoming query gets added to the cache regardless, so the index grows organically.
The diagram below sketches both flows side by side:
Semantic caching is seductive because it surfaces a genuinely higher hit rate. The trap is the threshold.
The correction isn’t subtle: keep the threshold conservative. In practice, cosine > 0.95 is a reasonable starting point for short factual queries. Domain scoping matters too — a cache built on product documentation queries shouldn’t serve hits to billing queries even if the sentences happen to be similar. Namespace your cache by domain, and scope semantic similarity checks within that namespace. If your product serves multiple customers, scope caches by tenant, not by query alone.
Cache Three: Provider-Side Prompt Caching
The first two caches live in your infrastructure. This one lives in the provider’s inference cluster, and the mechanics are different enough to deserve separate treatment.
When you send a prompt to a model like Claude or GPT-4, the first thing the model does is convert the input tokens into key-value pairs that the attention layers use during generation. That prefill step has a cost — compute, time, money. If you’re sending the same long system prompt on every request, you’re paying for that prefill every single time.
Provider-side prompt caching solves this by keeping the computed KV state for a stable prefix warm in memory. On Anthropic’s API, you mark the prefix with a cache_control breakpoint; on OpenAI, it happens automatically for long enough prefixes. The condition is byte stability: the cached prefix must be identical, character for character, on every request. Vary even a single token in the prefix and the cache busts.
The design implication is to front-load everything that doesn’t change — system instructions, tool schemas, reference documents, few-shot examples — and append only the user turn and session context at the tail. A system prompt that references today’s date in its opening line defeats the entire mechanism. Move dynamic content to the end.
The payoff is substantial. Anthropic charges a 25% write premium and gives 90% off on cache reads. For a 6,000-token system prompt sent a million times a day, that arithmetic is transformative. Unlike semantic caching, there is no answer-correctness risk here — the model still runs; only the KV computation is reused.
The Unglamorous Part: Invalidation
Every cache in this stack needs an invalidation strategy, and they all look slightly different.
For exact-match caches, invalidate by keying on model version and API parameters, not just the input text. An upgrade from text-embedding-3-small to text-embedding-3-large produces different vectors; if you key only on text, a stale vector from the old model gets served by the new one. Include the full model identifier and any relevant parameters in the hash.
For semantic caches, add a TTL on entries whose underlying facts can change — pricing, availability, policy. An answer that was accurate in January may be wrong in March. TTL doesn’t solve the semantic-correctness problem, but it bounds the staleness window.
For prompt caches, cache invalidation is straightforward: any change to the prefix busts it automatically because the prefix is no longer byte-identical. What you need to manage instead is the cost of busting — if your deployment pipeline regenerates the system prompt on every release, every release triggers a cold cache window while the provider re-warms. Pin the prompt hash in your deployment artifact and track it explicitly.
The Stack in Practice
Wiring all three together into a single request path gives you a layered defense against wasted compute. The incoming query hits exact-match first — cheapest lookup, zero false-hit risk. On a miss, it goes to the semantic layer, which embeds the query and checks similarity against the domain-scoped index. On a semantic miss, the request goes to the LLM, which benefits from the provider-side KV cache on its long stable prefix. You pay for generation only on genuinely novel queries.
In a mature production system, this means generating fresh answers for a small fraction of traffic. The exact-match layer handles repeated navigational queries — typically 40-60% of support bot traffic. The semantic layer catches the paraphrase tail — another 20-30%. What reaches the model is the genuinely novel, contextual, or complex subset — the queries that actually needed a generation in the first place.
Caching is the highest-ROI lever in an LLM stack, and it’s underused not because teams don’t know about it but because semantic caching — the most powerful tier — has a non-obvious failure mode. More cache hits can quietly mean more wrong answers. The discipline is treating the threshold as a product decision, not an infrastructure knob. How much false confidence can your users absorb before trust erodes? The answer to that question is your threshold floor.
The full treatment — including threshold calibration methodology, cache warming strategies, and per-provider prompt-caching pricing comparisons — is in the Generative AI → Systems Design at Scale → Caching section of the course.