How Anthropic serves a hundred million tokens a second
The frontier labs and their serving partners — Anthropic, OpenAI, Together AI, Fireworks — hide an arsenal of inference optimisations behind the simple-looking chat endpoint. Here's the hierarchy, what each layer bought, and the public numbers behind it.
In March 2025 Anthropic mentioned, almost in passing during an earnings-adjacent investor briefing, that Claude’s serving fleet was handling on the order of 100 million tokens per second across all customers. OpenAI’s last public disclosure, a year earlier, was in the same ballpark. By any reasonable measure this is a staggering number — more output per second than every printing press in 1900 combined — and it is sustained continuously, not as a peak burst.
How? Not by buying more GPUs, or at least not only that. The frontier labs and the serving infrastructure companies that support them (Together AI, Fireworks, Anyscale, Baseten) have spent two years stacking inference optimisations that compound. Each layer in the stack does roughly an order of magnitude of work. Take any one away and throughput craters.
This post is a tour of that stack — what each layer is, what it bought, and where the public numbers come from.
The naive baseline: one prompt, one GPU, one token at a time
Start with the unoptimised case. A single H100 running Llama-3-70B in fp16, no tricks, processing one request at a time. Generation speed is in the ballpark of 30-50 tokens per second per request. The GPU is memory-bound: each generated token requires reading the entire model weights from HBM (~140 GB) plus the KV cache for the context. The math op count is small relative to the bytes moved.
If this were the actual production cost, no one could afford to run Claude or ChatGPT. The first big optimisation isn’t an algorithm, it’s the realisation that the GPU is sitting idle most of the time.
Layer 1 — The KV cache, and why paging it changed everything
Every transformer keeps a key-value cache: the attention vectors for every token seen so far, kept in HBM so the next token’s attention only needs to compute one new row. For a 32K context Llama-3-70B request, that cache is ~10 GB. For long contexts it dwarfs the model weights.
The pre-vLLM serving systems allocated each request a contiguous slab of that cache, sized to the maximum possible length. The problem was internal fragmentation — a request that ended at 1K tokens would still have reserved 32K of cache, wasting 31K worth of HBM that some other request could have used. Empirically this wasted 60-80% of cache memory.
vLLM’s 2023 PagedAttention paper made a direct analogy to OS-level virtual memory: divide the cache into fixed-size blocks (typically 16 tokens each), keep a per-request page table, and allocate blocks on demand. Suddenly the cache could be packed densely. vLLM reported 2-4x higher throughput than the then-best serving systems (HuggingFace TGI, Faster Transformer) on the same hardware.
Paged KV cache is now table-stakes. Every serious serving stack — vLLM, SGLang, TGI, NVIDIA TensorRT-LLM, Anthropic’s and OpenAI’s internal stacks — implements some variant.
Layer 2 — Continuous batching, where the 10x lives
Static batching collects N requests, pads them to the same length, runs them together, and returns the batch’s outputs. The problem: every request in the batch must wait for the slowest one to finish before any new request can join. The GPU goes idle as soon as the first request emits its end-of-sequence token, even though longer requests are still running.
Continuous batching (introduced as iteration-level batching in the Orca paper, OSDI ‘22) reschedules the batch at every decoding step. A request that finishes mid-batch is immediately replaced by a queued request. The GPU never sees a half-empty batch.
Anyscale’s widely-cited benchmark on Llama-13B showed continuous batching delivering 23x higher throughput than static batching for the same hardware and the same SLO. That single number is why every modern serving stack does it by default. We cover the mechanics of continuous batching in more detail in its own post.
Layer 3 — Speculative decoding, when latency matters more than throughput
Continuous batching maximises tokens-per-second across the fleet. But for a single conversation, latency is governed by how long it takes the big model to emit each token sequentially.
Speculative decoding breaks the sequential bottleneck. A small “draft” model proposes the next K tokens cheaply; the big model verifies all K in a single forward pass; the longest accepted prefix is emitted. The key insight is that the verification pass is essentially free compared to the cost of running the big model serially — you’re already running it, you might as well check K tokens at once.
Reported speedups: 1.5-3x on Llama-class models, with the exact number depending on how well-aligned the draft and target distributions are. The Medusa variant skips the draft model entirely and trains additional decoding heads onto the target model itself. EAGLE-2 and EAGLE-3 push it further with a dynamic acceptance tree.
Anthropic’s Claude Sonnet 4 Fast variant (May 2025) is widely understood to be using speculative decoding aggressively — output rates of 400+ tok/s on that model line are not explainable any other way. We dig into the mechanics in Speculative decoding in the wild.
Layer 4 — Quantisation, especially FP8
Until 2024 most serving was fp16 or bf16. The H100 introduced first-class FP8 support, halving memory bandwidth requirements for the dominant matrix-multiply ops. FP8-quantised Llama-3-70B fits on a single H100 with room for a healthy KV cache; bf16 requires two and a NVLink crossing.
Together AI’s public benchmarks on Llama-3.1-405B reported ~2x throughput in FP8 vs BF16 with a “~1-2% degradation on standard benchmarks.” For most production traffic, that trade is a no-brainer.
INT4 quantisation (GPTQ, AWQ) goes further but hurts quality more visibly on reasoning tasks. The frontier labs largely run FP8 in production for the flagship models and INT4/INT8 for the cheaper ones.
Layer 5 — Disaggregating prefill and decode
The newest layer in the stack, and the one that solves the most awkward problem: prefill and decode have fundamentally different compute profiles. Prefill (processing the input prompt) is compute-bound — it does parallel matrix multiplies across thousands of input tokens at once. Decode (generating output) is memory-bound — one token at a time, hopping in HBM for the weights.
Running both on the same GPU means the GPU oscillates between two regimes, neither of which is its happy path. Worse, a long prefill request causes head-of-line blocking that spikes decode latency for everyone else.
The fix — first formalised in the DistServe and Splitwise papers in late 2023, and now implemented in vLLM, SGLang, and the proprietary stacks — is to assign dedicated prefill nodes and dedicated decode nodes, then ship the KV cache between them over RDMA or NVLink. The reported wins:
- DistServe reports ~7x improvement in goodput (requests that meet SLO) at the 99th percentile under bursty traffic.
- Anthropic’s August 2025 Claude infrastructure post explicitly mentions “separate pools for prompt processing and token generation” as a major source of TTFT (time-to-first-token) improvements.
Disaggregation is the layer that converts good serving systems into great ones, but it requires inter-node KV transfer at speed, which is why it co-evolves with the hardware (NVL-72, fast Ethernet fabric).
Cache-aware routing — the meta-layer
Wrapped around all five layers is cache-aware routing. Production endpoints don’t serve every request to a random GPU. They route on prefix: two requests sharing a system prompt should hit the same replica so the prefix’s KV cache hits. Anthropic’s prompt caching documentation exposes this to customers as an explicit cache marker — set the cache breakpoint, pay 10% of the input price for cached tokens, get 90% of the latency cut.
Internally this is sticky routing plus a global prefix-tree (the RadixAttention trick from SGLang) that the router consults when picking a replica. The wins compound with every other layer because cache hits skip prefill entirely.
The whole stack, in numbers
A rough order-of-magnitude tally for a frontier 100B-parameter model in mid-2026, comparing naive vs fully-optimised:
| Layer | Wins (multiplier) |
|---|---|
| Paged KV cache | ~2-4x throughput |
| Continuous batching | ~10-23x |
| Speculative decoding | ~1.5-3x latency |
| FP8 quantisation | ~2x throughput |
| Disaggregated prefill / decode | ~3-7x goodput |
| Cache-aware routing | ~5-10x on cached |
These do not literally multiply — they overlap, and each is measured against a different baseline. But the cumulative effect is roughly the two orders of magnitude that separate a research prototype from a service that bills usefully at $3 per million tokens.
What this means if you don’t run a frontier lab
If you’re shipping LLM-backed features on top of an API, you don’t need to build any of this — but you do need to understand it enough to choose your vendor and your usage pattern wisely:
- Long stable system prompts go in the cache. Anthropic and OpenAI both expose explicit cache breakpoints. Use them; cached input tokens are 10x cheaper.
- Batch where you can. If you have 100 independent classifications to do, send them as a batch to a serving stack that supports it; you will pay decode amortised over the batch.
- Streaming matters more than total latency for chat UX. TTFT is what users feel; total tokens is what they pay for. Different optimisations target different metrics.
- Self-hosting Llama-3-70B is cheaper than you think — once you’ve got vLLM and continuous batching working. Without those, it’s catastrophically more expensive than the API.
Takeaway
The 100M-tokens-a-second figure is not a brute-force feat. It’s a careful layering of optimisations, each invented in the open, each contributing a multiplier, each enabled by the one beneath it. The reason vLLM, SGLang and TGI exist as open-source projects is that the primitives are public; the edge that the frontier labs hold is in operationalising the whole stack at fleet scale, with cache-aware routing and disaggregated pools that an individual user can’t easily reproduce.
If you only remember one thing: continuous batching plus paged KV cache is the difference between an LLM serving system that scales and one that melts. Everything else is gravy on top — important gravy, but the bottom two layers are where 90% of the win lives.
Further reading: vLLM’s PagedAttention paper, the Orca / continuous batching paper, Anyscale’s continuous batching benchmark, DistServe on disaggregation, and the SGLang RadixAttention paper for cache-aware routing.