Prompt caching: the 90% cost cut explained
Anthropic, OpenAI, DeepSeek and Google all ship prompt caching now, but the pricing models diverge sharply. For a tool-using agent with a long system prompt, getting the cache pattern right is the difference between a viable product and a P&L disaster.
If you run a tool-using agent in production in 2026, your dominant cost is almost certainly input tokens you keep re-sending. The system prompt, the tool schemas, the few-shot examples, the retrieval scaffolding, the user history — all of it gets re-encoded on every turn. For a typical Claude-Sonnet-powered agent with eight tools and a 6K-token system prompt, the prompt cost outweighs the completion cost by 4-6x even before the conversation gets long.
Prompt caching is the lab response to that asymmetry. The idea is dead simple: the KV cache for the prefix is already in HBM after the first request; on subsequent identical prefixes, skip the prefill compute and just rehydrate the cached state. The pricing then reflects what was saved.
What’s not dead simple is that the four major providers have settled on four meaningfully different pricing models, and the engineering pattern that maximises savings on Anthropic actively works against you on Google. This post walks through how each works, what they cost, and what an agent architect should actually do.
The four pricing models
Start with the Anthropic model, which is the most explicit. You mark a prefix in your prompt with a cache_control breakpoint. On the first call, Anthropic charges 1.25x the base input price to write the cache (a 25% premium). On every subsequent call within the 5-minute TTL, that prefix costs only 0.1x the base input price — a 90% discount. There’s also a 1-hour TTL option that costs 2x to write but persists longer for sporadic traffic. For Sonnet 4.6, that translates to $3.75/MTok to write the cache, $0.30/MTok to read it, against the $3/MTok base.
OpenAI took the opposite approach. Their caching is fully automatic, kicks in at 1024 tokens, and writes are free — but the discount is smaller, around 50% off cached input tokens rather than 90%. There’s nothing to configure; the system caches the longest seen prefix and increments in 128-token blocks. Cache eviction is opaque, typically 5-10 minutes during peak hours, sometimes up to an hour off-peak.
DeepSeek is the most aggressive. Their Context Caching on Disk runs automatically with no breakpoints, no premium on writes, and reads are charged at 2% of base input on V4-Flash — a 98% discount. On V4-Pro the read price is $0.0145 per million tokens against $1.74 for cache misses. The trade is that you’re locked into their inference stack and their TTL behavior, but for high-volume agent workloads this is the cheapest cache on the market by a wide margin.
Google’s Gemini is the outlier on storage. They charge a per-hour rent on cached tokens — $4.50 per million tokens per hour on Pro, $1.00 on Flash — in addition to the 90%-off read discount. This makes sense if you understand the design goal: Gemini’s caching is built for enormous stable contexts (1M-token codebases, full document corpora), where the storage cost is a rounding error against the read savings. For a 3K-token agent system prompt it’s not the right tool.
Why caching exists at all — the technical primitive
To understand why the pricing landed where it did, it helps to be precise about what’s being cached. In a transformer, generating each token requires attention over every previous token’s key/value vectors. Those vectors are computed during the prefill phase — the compute-heavy pass over the input prompt — and stored in the GPU’s HBM as the KV cache. For a 6K-token prefix, the prefill is a non-trivial fraction of a second on an H100 and consumes serious FLOPS.
The insight behind prompt caching is that two requests with identical prefixes will produce bit-identical KV cache entries for those prefix tokens. If the system has retained the cache from request 1, request 2 can skip prefill for the shared prefix and start generation immediately. The compute saved at the lab is real; the question is how that saving gets priced for customers.
Anthropic and Google chose to make the customer declare the cache (you mark the breakpoint, you get the discount). OpenAI and DeepSeek chose to make the cache implicit (the system detects shared prefixes automatically). Both are legitimate engineering choices. Explicit caching gives the operator more control and clearer cost predictability; automatic caching is friendlier to users who haven’t thought about cache hierarchies yet.
The other dimension is storage. Cache entries occupy HBM, and HBM is expensive. Anthropic’s 5-minute TTL is essentially saying “we’ll keep your prefix in fast cache for 5 minutes after last access.” The 1-hour option pays for longer residency. Google’s per-hour storage charge is the most explicit pricing of this trade-off — you literally rent the HBM minutes.
A note on the OpenAI batching API interaction
A nuance worth being aware of: OpenAI’s Batch API (50% discount on input/output for jobs that complete within 24 hours) and prompt caching can interact in counterintuitive ways. Batch jobs are queued and executed asynchronously, often in a different serving pool than interactive requests. The result is that prompt cache hit rates on Batch jobs can be lower than on interactive ones — the cache may not be warm when your batched job actually runs.
For workloads that combine the two (overnight batch processing of a high-volume dataset with shared system prompts), the actual cost savings can be lower than the simple “50% batch + 50% cache” math would suggest. The compound discount isn’t quite multiplicative in practice.
The pragmatic answer is to test the actual behavior with your real workload before assuming the savings. For high-volume batch users, talking to your account team about whether you can pin cache behavior across batch jobs is sometimes worth a conversation.
The pricing isn’t only about cost
A frequent oversight: prompt caching pricing is also implicitly capacity allocation. Anthropic’s 5-minute TTL keeps cached entries out of long-term HBM residency, which lets them serve more distinct users. The 1-hour option at 2x price reflects the actual capacity cost of pinning HBM for an hour. DeepSeek’s aggressive caching on V4-Flash is partly enabled by their lower model serving costs — they can afford to keep more in cache.
For high-volume customers, the pricing decisions reflect what’s economically possible from the provider’s side as much as what’s marketable. Anthropic moved to a 1-hour option once their serving infrastructure could support it; before that, they couldn’t have offered it at any price. The cache durations and TTLs you see are constrained by the realities of HBM allocation across millions of concurrent users.
How a tool-using agent breaks down
The reason caching matters so much for agents specifically is that the per-turn input profile of an agent is dominated by stuff that doesn’t change. Let’s break down a realistic Claude Sonnet 4.6 agent:
System prompt : 1,200 tokens (stable forever)
Tool definitions (8) : 4,800 tokens (stable until tool change)
Few-shot examples (5) : 2,400 tokens (stable per agent version)
Conversation history : 3,000 tokens (grows monotonically)
New user turn : 150 tokens (the only "fresh" bit)
--------------
Total input : 11,550 tokens
Without caching, every turn costs 11,550 × $3 / 1M = $0.0347 in input alone, plus the output cost. At 1 million turns a month, that’s $34,700 in input costs.
With Anthropic-style caching applied to everything except the new user turn (the first 11,400 tokens become the cached prefix), the steady-state cost per turn drops to roughly:
Cached read (11,400 t) : 11,400 × $0.30 / 1M = $0.00342
Fresh input (150 t) : 150 × $3.00 / 1M = $0.00045
----------
Per-turn input cost : $0.00387
That’s an 89% reduction in input cost — almost exactly the 90% the marketing promises. At a million turns a month, you go from $34,700 to $3,870. The write premium gets amortized in literally the first cache hit.
This is why Anthropic’s prompt caching page leads with “break-even at one read.” It is genuinely that good — if you structure your prompt right.
A common mistake — caching writes you didn’t mean to write
A subtle pitfall worth calling out: with explicit-breakpoint providers like Anthropic, accidentally placing the breakpoint inside a section that varies per-request is more expensive than not caching at all. You pay the 1.25x write premium on every request and you get no read hits because no two requests share the prefix exactly.
A real example. A team set their Anthropic cache breakpoint after a section that included the current ISO timestamp (“Current time: 2026-05-12T14:32:17Z”). Every request had a fresh timestamp; every request triggered a fresh cache write. The cache write count was 100% of requests. The team’s cost went up 25%, not down, because they were paying the write premium without ever benefiting from a read.
The fix was to move the timestamp below the breakpoint, in the user turn section. Cache hit rate jumped from 0% to 94% overnight. The lesson: think hard about where the breakpoint goes, and instrument the write-to-read ratio. A healthy ratio is well below 0.1 (i.e., many reads per write); a ratio close to 1.0 means you’re caching things you shouldn’t be.
When caching doesn’t help
It’s worth being explicit about the workloads where prompt caching is a wash or worse:
- Single-call workloads with no reuse. If a customer makes one call per system prompt and then never repeats, you pay the write premium with no reads to amortize. This is rare for agents but common for one-off batch jobs.
- Highly variable prompts. If every request has a substantively different prefix (different user context dumped in, different tools loaded), the cache hit rate stays low. The fix is to refactor toward shared prefixes, but sometimes the use case genuinely requires variability.
- Very short prompts. Below the cache minimum (1024 tokens for OpenAI, smaller for others), caching is either not engaged or not worth the overhead. The savings on a 500-token prompt are pennies; the engineering cost of cache discipline outweighs the win.
- Streaming-only workloads where every byte matters. Some agents stream multiple turns within a single API call (streaming tool use). The caching interactions here are subtle and provider-specific.
For these workloads, the right move is to not bother with explicit cache management — either accept the cost, or use a provider with automatic caching (OpenAI, DeepSeek) where it’ll engage when it can and not penalize you when it can’t.
The prefix-stability discipline
The catch is buried in three words: identical prefix bytes. If a single token in the cached prefix changes, the cache miss is total. There’s no partial match, no fuzzy lookup — you pay full input price for everything from the divergence forward, plus the cache-write premium if you want to repopulate.
For an agent this implies a strict ordering discipline. The rule is most-stable-first:
Three failure modes to avoid:
- Timestamp leakage in the system prompt. Don’t include “Today is 2026-05-12” in cached prose — that string mutates every day. Inject the date via a tool call instead, or place it in a non-cached section.
- Non-deterministic tool ordering. If you load tools from a Python dict and serialize in iteration order, you’ll get a different prefix between processes. Sort by tool name before emitting JSON.
- User identity in the system prompt. “You are talking to user_id=8347” makes the cache useless across users. Push user-specific info below the breakpoint.
A real cautionary tale: in early 2026, Cursor’s Composer 2.5 team reported that a single misplaced lastModified field in their tool schema was costing them roughly $200K a month in lost cache hits across the fleet. The fix was a one-line sort.
The latency story
Cost is the headline, but latency is the other half of why prompt caching matters. Prefill is the slow part of inference for long prompts — for an 8K-token prefix on a frontier model, prefill is often 200-500ms before any token comes back. A cache hit reduces prefill latency by an order of magnitude, often to under 50ms.
For interactive products (chatbots, coding assistants), this is the difference between feeling responsive and feeling sluggish. Time-to-first-token (TTFT) is the metric most user-facing AI products optimize for, and a cold-cache TTFT is often 5-10x worse than a warm-cache TTFT. Users notice the difference even when they can’t articulate it.
The latency savings compound with the cost savings, but they’re independently meaningful: even if cost weren’t a factor, you’d want caching for the UX. This is why some teams configure aggressive cache writes even for prefixes they expect low reuse on — the latency win on the rare hit can be worth the write overhead.
Where the providers diverge in practice
Once you’re past the basics, the operational differences matter more than the headline numbers.
Anthropic. The 5-minute TTL is short enough that bursty single-user traffic — a coding session, a research thread — keeps the cache warm, but cross-user reuse rarely works unless you have very high QPS on the same prefix. The 1-hour TTL fixes the latter at 2x write cost. Anthropic supports up to 4 cache breakpoints, which lets you cache the system+tools as one block and the user history as a separate block that re-extends each turn. That second breakpoint is the trick that makes long conversations economical.
OpenAI. Free writes are a real advantage when prompts are highly variable but exceed 1024 tokens. You don’t have to think about it; the savings just show up. The 50% read discount is worse than Anthropic’s 90%, but you avoid the write premium entirely, so for low-reuse workloads (think: 2-3 calls per cache instance) you can come out ahead. The downside is the lack of explicit control — you can’t tell OpenAI “this 5K prefix is stable, prefer to keep it.” It guesses.
DeepSeek. When your traffic pattern fits, this is uncontested on price. A million-call-per-day agent with a stable 8K system prefix on V4-Flash is paying pennies per day on input tokens. The catch is the rest of the stack — model quality is competitive but not frontier, and your geopolitical risk profile may not allow it.
Google. Gemini’s per-hour storage rent makes the math invert: it’s a bad choice for short stable prefixes and a brilliant choice for million-token contexts that get queried hundreds of times an hour. The classic Gemini caching workload is “load the entire codebase, answer 200 questions about it in 30 minutes, discard.” For agent system prompts of under 10K tokens, most analyses put Gemini behind both Anthropic and OpenAI for cost-effectiveness.
A worked example — the long-running coding session
Consider Claude Code running for an hour against a large codebase. The system prompt and core tool definitions sum to about 8K tokens; the conversation history can balloon to 60K+ by the end of the session as files get read and tool outputs accumulate.
Without caching, every turn re-encodes 60K-80K tokens at the full $3/MTok input price. A 20-turn session ends up costing $4.50-$6 in input tokens alone — for a single user session.
With Anthropic’s two-cache-breakpoint pattern (one for system+tools, one for conversation history):
Turn 1: write cache @ 1.25x for 8K system prefix = $0.030
full input for 200t user turn = $0.0006
--------
$0.031
Turn 5: cache read for 8K system prefix = $0.0024
+ write cache @ 1.25x for 30K conv history = $0.1125
+ 200t user turn = $0.0006
--------
$0.116
Turn 15: cache read for 8K (system) + 60K (history) = $0.0204
+ 200t user turn = $0.0006
--------
$0.021
By turn 15, with both prefixes warm, the per-turn input cost has dropped from $0.20+ to about $0.02 — an order of magnitude. Over a long session, the cache write costs amortize into noise. This is the architecture Anthropic recommends in their documentation, and it’s why Claude Code’s per-session economics work despite the heavy token throughput.
A useful corollary: the second cache breakpoint (on the growing conversation history) is what makes long sessions economical. Without it, every turn re-encodes the entire history. With it, the history gets incrementally cached as it grows.
Edge cases in the worked example
The above math assumes the cache stays warm. In practice, the 5-minute TTL is the most common reason a long-running session loses cache hits. If a user pauses Claude Code for 6 minutes to read documentation, the cache evicts, and the next turn pays full input price again — plus the write premium to re-warm.
The Anthropic 1-hour TTL is the answer for human-paced sessions, at 2x write cost. The economics shift: you pay more on the write but you survive natural pauses. For a typical coding session with 5-minute lulls, the 1-hour TTL has a clear edge once the user makes more than two requests inside the window.
A side observation: the lab’s internal cache management almost certainly retains entries longer than the advertised TTL when HBM isn’t under pressure. Real-world cache hit rates often look better than the worst-case TTL would suggest. But you shouldn’t depend on this — design for the advertised behavior, and treat any over-performance as a bonus.
The cache-hit rate is now a KPI
The most concrete lesson from teams running large agent fleets in 2026: the cache hit rate is now a top-level operations metric, alongside p50 latency and token cost. The teams I’ve talked to at Replit, Cursor, and a handful of enterprise platforms target a 90%+ cache hit rate on the system+tools prefix and treat any regression below 85% as a Sev-2 incident.
How to instrument it:
- Anthropic returns
cache_creation_input_tokensandcache_read_input_tokensin the usage block of every response. The hit rate iscache_read / (cache_read + cache_creation + non-cached input). - OpenAI returns
prompt_tokens_details.cached_tokenson every response. Watch this in your aggregations. - DeepSeek returns
prompt_cache_hit_tokensandprompt_cache_miss_tokensper call. - Google returns
cachedContentTokenCounton Gemini responses when explicit caching is engaged.
The ops pattern: log the hit rate per agent version, alert when a deploy drops it below a threshold, treat a degradation as a real bug. In effect, the cache hit rate becomes the proxy metric for “did someone accidentally put a timestamp in the system prompt this week.”
The multi-tenant cache problem
A subtle issue most teams hit a few months into a heavy caching deployment: cache hit rates that look great in single-user testing can be terrible in multi-tenant production. Two users with slightly different system prompts (a custom persona, a user-specific tool) generate completely different prefixes, and the cache doesn’t see them as the same.
The math gets harsh quickly. Imagine a SaaS product with 10,000 daily active users, each with a slightly customized system prompt. If the per-user prompt is unique, you’re effectively running 10,000 separate caches, each warmed by a single user’s QPS. For most users, the 5-minute Anthropic TTL will expire between requests, leading to cache miss after cache miss.
The architectural responses, in increasing order of engineering investment:
- Templated prompts. Push user-specific values into a structured section below the cache breakpoint. The shared prefix stays identical across users; only the unshared tail differs. This is the cheap fix and usually recovers 90% of the cache benefit.
- Hierarchical breakpoints. Use multiple cache breakpoints: one for the truly universal prefix (cached across all users), one for the per-org prefix, one for the per-user prefix. Each level gets its own TTL behavior.
- Cache-aware routing. For very high QPS deployments, route requests with the same prefix to the same backend replica to maximize prefix-cache hit on the lab side. This is something Anthropic and OpenAI’s serving layers do automatically; you can sometimes nudge it with consistent hashing on your end.
- Long-TTL caches for stable global prefixes. Use Anthropic’s 1-hour TTL on the universal prefix (paid at 2x write but rarely re-written) and the cheaper 5-minute TTL on the per-user tail.
A useful internal metric is the prefix uniqueness ratio — the fraction of distinct prefixes vs. total prompt sends. Lower is better. A ratio of 0.1 means every prefix is shared by ~10 requests on average; a ratio of 1.0 means every prefix is unique. The teams running healthy multi-tenant caching keep this ratio below 0.2 in the steady state.
What changes when you switch providers
A subtle operational consideration: the same agent code running on Anthropic vs. OpenAI vs. DeepSeek will hit different cache behaviors even with no source-code changes. The same prompt structure that gives a 90% cache hit rate on Anthropic might land at 65% on OpenAI because of the smaller discount, and at 98% on DeepSeek because of automatic prefix detection.
If you build a multi-provider abstraction layer (LiteLLM, LangChain’s ChatModel interface, your own), you should be aware that the unit economics shift dramatically across providers. The exact same agent can cost 2-3x more on one provider than another simply because the cache mechanics differ. Decisions made for provider portability sometimes cost more in cache savings than the portability was worth.
The pragmatic approach used by teams shipping at scale: pick a primary provider, optimize the prompt structure for that provider’s cache model, and accept that fallback to a secondary provider will be more expensive. Don’t try to optimize for all four simultaneously — the constraints are too different.
What this means for build-vs-buy
Three production takeaways for anyone shipping an agent in 2026:
- Long, stable prefixes are now table stakes. If your agent has a 6K-token system prompt and tool schema and you’re not caching, you’re paying ~5-10x what you should. Both API and cost dashboard will tell you so within a week of going live.
- Pick your provider partly on your traffic pattern. Bursty single-session traffic loves Anthropic’s explicit 5-minute TTL. Highly variable but always-large prompts love OpenAI’s automatic free writes. Massive contexts hammered for short windows want Gemini. Heavy steady-state agentic traffic on a budget should at least price out DeepSeek.
- Treat prefix stability as code. Lint your tool schemas. Sort your tool list deterministically. Inject volatile data via tool calls, not system prompt strings. The cost of getting this wrong is a recurring monthly invoice surprise.
The most striking thing about prompt caching in 2026 is how universal it has become — three years ago this was a research curiosity; today every serious agent product is built around it. The 90% headline number is real, but only if you treat your prefix as a versioned artifact, not as casual prose. Get that discipline right and the next time someone asks why your gross margins are so much better than competitors’, you’ll be able to point at a single graph.
The other lesson worth carrying away: caching is now infrastructure, not a feature. It has the same character as connection pooling, response compression, or HTTP/2 — once it exists, building without it becomes increasingly expensive. The teams that treated caching as a 2024 optimization were still optimizing it in 2026; the teams that treated it as a 2026 architectural requirement built their products around it from the start. Both can ship, but the cost structures look very different at scale, and that difference compounds over months and years.
For an agent product that’s billed by usage, the prompt cache hit rate is in many ways the most important number on the dashboard. It governs gross margin, it governs unit economics, and it governs whether the product is sustainable at the prices being charged. Every team I’ve talked to that scaled an agent product past meaningful revenue had a story about realizing this and reorganizing their engineering priorities accordingly. Most of them had at some point burned several hundred thousand dollars on cache misses before the realization. Don’t be that team.
Further reading: Anthropic’s prompt caching docs, OpenAI’s caching announcement, DeepSeek’s KV cache documentation, Gemini’s context caching reference, and Artificial Analysis’s cross-provider cost comparison. For the underlying serving-stack mechanics, see our companion piece on how Anthropic serves a hundred million tokens a second.