datarekha
Infrastructure May 27, 2026

Surviving 10,000 concurrent requests to your LLM API

An LLM request spends 99% of its life waiting. Design around that one fact and 10k concurrent users stops being scary.

8 min read · by datarekha · llmscalingasyncload-balancinginfrastructure

It was 11:47 PM on a Tuesday when the on-call page fired. A startup had just gone viral — a TechCrunch article dropped at midnight and by 11:49 their API gateway was returning 504s to everyone. The founder was watching their Twitter mentions explode in real time while their backend silently burned down.

The post-mortem was not complicated. They’d built a wrapper around a hosted LLM API — smart, clean Python, nothing obviously wrong. Requests came in, a thread pool dispatched them to the upstream model, responses came back. It worked fine in staging. It worked fine at 100 concurrent users, at 500. Then the article hit and traffic spiked 50x in four minutes. The thread pool pegged. The process’s memory climbed past the container limit. Kubernetes killed the pod. The replacement pod started, got hammered instantly, died again. Rinse, repeat. The system had no theoretical ceiling on concurrency because nobody had thought about what the threads were doing.

What they were doing — almost all of the time — was nothing. They were waiting.

The one fact that changes everything

An LLM request is not like a database query. A database round-trip might take 5 milliseconds. An LLM streaming response takes 2 to 30 seconds and trickles tokens back the whole way. Your application code sends a prompt, then sits completely idle while the model generates. The CPU is not spinning. No computation is happening on your side. The thread is parked, burning megabytes of OS stack space, waiting for the next token to arrive over a TCP connection.

In traditional web serving this is acceptable because your average request completes in tens of milliseconds and thread overhead amortizes away. With LLM calls the math inverts. If your median response latency is 8 seconds, and you’re on a machine with 8 GB of RAM, and each thread consumes its default 8 MB stack, you can hold roughly 1,000 threads before you run out of memory — with nothing but idle waiting on each one. That’s your concurrency ceiling, imposed purely by the cost of doing nothing.

Scale that to 10,000 concurrent users and you need a fundamentally different model. Not more RAM, not bigger VMs. A different abstraction for waiting.

SYNC: thread per requestthreadthreadthreadthreadOOMeach thread: ~8 MB stack, 100% idlewaiting… waiting… waiting… waiting… waiting…waiting… waiting… waiting… waiting… waiting…waiting… waiting… waiting…ceiling ~1k threads — then OOMASYNC: event loopevent loop1 OS threadawaitawaitawaitawaitawaitawait10k coroutines, one threadeach “await” yields back to loopceiling = downstream throughput
Thread-per-request runs out of memory holding idle waits. An async event loop holds the same 10,000 waits as cheap coroutines on a single OS thread.

Move 1: stop threading, start awaiting

The insight behind async I/O is almost offensively simple: if you’re waiting for network I/O, you don’t need a thread. You need a bookmark. A coroutine suspended at await response.aiter_lines() costs a few hundred bytes of heap, not 8 MB of stack. The event loop — asyncio in Python, the Tokio runtime in Rust, libuv in Node — manages thousands of these bookmarks, checks the network selectors, and resumes whichever coroutines have new data. One OS thread does all of it.

The practical consequence is that the per-request overhead drops by roughly four orders of magnitude. Where you could hold perhaps 1,000 blocked threads in a few gigabytes of RAM, the same machine running an async framework like FastAPI over uvicorn, or Starlette with an async httpx client, can hold tens of thousands of in-flight LLM calls limited only by open file descriptors and the connection pool to the upstream model. In Python specifically, raising ulimit -n to 65,536 and tuning asyncio’s event loop policy (use uvloop in production — it is a drop-in that typically doubles throughput on I/O-heavy workloads) is the first config change that costs essentially nothing and buys a great deal.

There are two traps to avoid once you go async. First, never call a synchronous, blocking HTTP client inside an async def function. A requests.get() inside an async handler blocks the entire event loop for the duration of the call — all other concurrent waits stall behind it. Use httpx.AsyncClient, aiohttp, or the async path of whatever SDK you’re calling. Second, never do CPU-intensive work — parsing large documents, running embeddings, anything that burns a full core for more than a millisecond — directly in the async path. That work belongs in a thread pool (asyncio.run_in_executor) or a separate worker process. The event loop is for waiting, not for computing.

When these rules hold, the ceiling shifts. It is no longer determined by how many threads you can stuff into RAM. It becomes: how many tokens per second can the downstream model actually produce, and how many connections can the model provider accept from your IP range? Those are real limits, but they are infrastructure limits you can negotiate and scale, not architectural limits you stumbled into by picking the wrong abstraction.

Move 2: replicas, and the round-robin trap

One async process on a beefy VM will get you far, but single-process concurrency has its own ceiling: the connection pool to your upstream model provider, ulimit on file descriptors, and — critically — the fact that you are still a single point of failure. At any serious scale you run N replicas behind a load balancer.

Here is where teams consistently make a mistake that looks innocuous until traffic gets heavy: they use round-robin load balancing, the default on almost every L4 load balancer and on many L7 ones. Round-robin assumes requests are equal. LLM requests are not equal. A “what’s 2+2?” request takes 200 milliseconds. A “rewrite this 6,000-word document” request takes 45 seconds. Under round-robin, Replica 3 might end up holding all the long-running summarizations while Replica 1 has finished its quick Q&A calls and is sitting idle. The overloaded replica slows down, starts queuing new requests behind its existing concurrency, and your tail latency climbs even though your fleet is collectively underutilized.

The fix is least-outstanding-requests (LOR) routing, sometimes also called least-connections at the request level. The balancer tracks how many active requests each replica is currently handling and sends new arrivals to whichever replica has the most headroom. This self-corrects naturally: the replica burning time on long generations accumulates outstanding count, so the balancer stops sending it new work until it catches up. LOR is available in nginx (as least_conn), in Envoy (as LEAST_REQUEST), and in every major cloud load balancer under slightly different names. Switching from round-robin to LOR is a one-line config change that, in practice, flattens the tail latency distribution and improves effective throughput by 15–40% on workloads with variable request lengths.

A second load-balancing optimization that is specific to LLM serving and underappreciated in application-layer proxy literature: prefix-aware routing. Most hosted inference providers and self-hosted stacks (vLLM, SGLang) maintain a KV cache — a cache of attention key/value tensors for prefixes they’ve already processed. If you have a system prompt that’s identical across all requests — a 2,000-token context window preamble, for instance — and you can route all requests that share that prefix to the same replica, that replica can serve the cached prefix essentially for free instead of recomputing it on every request. At production token volumes this is a nontrivial fraction of inference cost. The technique requires that your balancer be prefix-aware (vLLM’s --enable-prefix-caching flag combined with a sticky-by-prefix routing rule), but it is worth the config complexity for high-volume deployments with shared system prompts.

Move 3: when capacity runs out, be explicit about it

Async plus smart load balancing handles most traffic spikes. But no fleet is infinitely elastic, and there will be moments when you hit the ceiling. What happens then is the third trap.

The naive async pattern has a subtlety: an await that is waiting for a slot in a connection pool is still alive. It is holding memory, it is in the event loop’s scheduler, and if you have 50,000 requests pile up waiting for a 200-connection pool to free up, you have 49,800 coroutines quietly consuming heap until the process keels over. This is technically an async OOM and it is harder to debug than the thread-based version because memory growth is slower and doesn’t correlate obviously with CPU.

The correct pattern is bounded concurrency with explicit backpressure. In Python that means an asyncio.Semaphore with a hard cap — say, 500 — around your LLM calls. Any request that cannot acquire the semaphore gets a 429 immediately, not a slow wait. Return the 429 with a Retry-After header, ideally with some jitter so clients don’t synchronize their retries into a thundering herd. The semaphore acts as a circuit breaker that keeps your process alive when demand exceeds capacity.

For workloads that are genuinely async from the user’s perspective — batch document processing, overnight summarization jobs, async enrichment pipelines — the right answer is a job queue rather than long-lived HTTP connections. Celery, Temporal, or a simple Redis-backed queue decouples request acceptance from request processing. Your HTTP layer accepts work, enqueues it, returns a job ID immediately, and a separate pool of workers drains the queue at a rate that matches your LLM capacity. The user polls or gets a webhook when the job completes. This pattern handles the traffic spike at launch, the overnight batch, and the weekend cron job with the same infra because the queue absorbs the burst and the workers process at steady state.

The shape of the problem, clearly

Zoom out and these three moves share a common thread. Every design decision comes back to the same observation: an LLM request spends the vast majority of its wall-clock life in the model, not in your code. Your job is to hold in-flight work as cheaply as possible (async coroutines), distribute that work as evenly as possible across your serving capacity (least-outstanding-requests), and shed load explicitly when you hit the ceiling (bounded semaphores and queues) rather than letting your process discover the limit through collapse.

10,000 concurrent requests is not a hardware problem. A modern async server process running on a 4-core VM handles that concurrency level comfortably if the work is I/O-bound. The mistake is treating LLM serving like compute-bound request handling — thread-per-request, round-robin, unlimited concurrency — because the mental model was built for a different class of workload.

The startup from the TechCrunch night rebuilt their stack over the following week. Synchronous thread pool replaced with FastAPI plus uvicorn and an async httpx client. Round-robin replaced with least-outstanding-requests on their load balancer. A semaphore set to 400 to return clean 429s when upstream was saturated. Their next traffic spike — a Product Hunt launch a month later — produced higher peak concurrency than the TechCrunch night. They didn’t page anyone.

The full architecture for designing LLM systems at scale — including inference routing, KV cache management, and rate-limiting strategies — is covered in depth in our Generative AI → Systems Design at Scale lessons.

Skip to content