Async vs sync: handling concurrency
Why an LLM API serves 10,000 waiting requests on one thread — and when async helps, when it doesn't.
What you'll learn
- Why LLM calls are I/O-bound (you wait on the model, you don't compute)
- How a blocking thread-per-request server dies under load
- How one async event loop juggles thousands of in-flight requests
- Concurrency vs parallelism, and async (I/O) vs processes (CPU)
- The pattern: async FastAPI + pooled async HTTP client + backpressure
Before you start
The waiter analogy
Imagine a restaurant where the rule is: one waiter serves one table for the entire meal — taking the order, then standing beside the table staring at the kitchen until the food arrives, then delivering it. That waiter is blocked waiting. For most of the meal they’re doing nothing useful. A 50-table restaurant needs 50 waiters.
Now imagine a competent waiter: takes Table 1’s order, submits it to the kitchen, walks over to Table 2, takes their order, checks in with Table 3, comes back when Table 1’s food is ready. One waiter handles 20 tables because the kitchen prep time (the slow part) frees them up to do other things.
An LLM server works the same way. The “kitchen” is the model — generating tokens takes hundreds of milliseconds to many seconds. Your server is the waiter. Sync = one waiter per table. Async = one competent waiter for the floor.
Why LLM requests are I/O-bound
When your server handles a call to an LLM API, the timeline looks like this:
- Send the HTTP request — microseconds.
- Wait while the model generates tokens — typically 200 ms to several seconds.
- Receive the response bytes — milliseconds.
Steps 1 and 3 are tiny. Step 2 is almost the entire request time, and during that wait your server is doing nothing. No CPU computation, no memory churn — just waiting on bytes from a remote host. This is the definition of I/O-bound: the bottleneck is input/output (network), not the CPU.
Contrast this with CPU-bound work: computing embeddings locally, running a Torch model on-device, tokenizing a large batch. There, every millisecond of elapsed time is a millisecond of actual computation. The CPU never idles.
The async vs sync choice matters enormously for I/O-bound work. For CPU-bound work, async does nothing helpful — more on that later.
Sync: the blocking thread-per-request model
A classic synchronous web server assigns one OS thread to each incoming request. When that request calls the LLM and waits for a response, the thread blocks — it sits parked, consuming memory, until the bytes arrive. Then it finishes and returns the thread to the pool.
The cost: each OS thread needs a stack of roughly 1–8 MB. 10,000 concurrent requests means 10,000 threads means roughly 10–80 GB of stack memory — before you’ve written a single line of your application logic. On top of that, the OS must context-switch between those threads thousands of times per second, adding latency and CPU overhead for work that is, again, mostly just waiting.
In practice, thread-pool servers (like Gunicorn with sync workers) cap at a few dozen to a few hundred workers. Beyond that they either reject requests or exhaust memory. Hitting 10,000 concurrent LLM waits on a sync server means most requests queue and time out.
Async: one event loop, thousands in flight
Python’s asyncio (and the event loop inside frameworks like FastAPI/Uvicorn) works differently. There is one thread running an event loop — a scheduler that continuously asks: “who is ready to run?” Requests are coroutines — functions that can pause themselves at a specific suspension point marked await.
When a coroutine hits await llm_client.post(...), it yields control back to the event loop. The loop picks up another coroutine that has data ready, runs it until its next await, and so on. When the network bytes for the first request finally arrive, the event loop resumes that coroutine right where it left off.
The key insight: a paused coroutine is almost free. It’s a few kilobytes of state on the heap — a saved stack frame and a reference to where execution should resume. Holding 10,000 in-flight awaits costs roughly tens of megabytes of heap, not gigabytes of OS thread stacks. No context-switch thrash, no kernel scheduler overhead.
The ceiling shifts from “how many OS threads can the kernel manage” to “how many connections does your upstream LLM provider allow, and how much memory for coroutine state” — both of which are far more generous.
Concurrency vs parallelism
These terms are often confused. They mean different things.
Concurrency is dealing with many things by interleaving their progress — like a single waiter managing 20 tables. Progress happens on multiple tasks, but only one at a time is actively running. Async asyncio gives you concurrency on a single core.
Parallelism is doing many things at the same time on multiple cores (or CPUs) simultaneously — like having 4 cooks each preparing a dish in parallel. True parallel execution needs multiple OS threads or processes.
For I/O-bound LLM work, concurrency is sufficient and cheap. The CPU isn’t the bottleneck; the network wait is. Async gives you thousands of concurrent in-flight requests on one core, which is exactly what you need.
For CPU-bound work (local embedding inference, tokenizing a 100k-token batch, running a Torch model on the host), concurrency does nothing useful — the CPU is fully occupied, so interleaving just switches between tasks that all need CPU time. You need parallelism: multiple processes via multiprocessing, a thread pool (for workloads that release the GIL), or a dedicated GPU inference service.
The production pattern
Here is the shape of a production-grade async LLM endpoint:
import asyncio
import httpx
from fastapi import FastAPI
app = FastAPI()
# ONE shared async client for the whole process lifetime.
# Connection pooling is built-in — do not construct a new client per request.
_llm_client = httpx.AsyncClient(base_url="https://api.anthropic.com", timeout=120.0)
# A semaphore caps concurrent in-flight requests to the LLM.
# This is backpressure: refuse to pile up more awaits than you can serve.
_sem = asyncio.Semaphore(200) # tune to your provider's rate limit
@app.post("/generate")
async def generate(payload: dict):
async with _sem: # acquire a slot; block if 200 already in flight
resp = await _llm_client.post( # await: yields the event loop while waiting
"/v1/messages",
headers={"x-api-key": "..."},
json=payload,
)
return resp.json()
Four things to notice:
async defendpoint — FastAPI/Uvicorn runs this as a coroutine; it never blocks the event loop.- Shared
httpx.AsyncClient— creates and reuses a connection pool. Never constructhttpx.AsyncClient()inside the request handler; that opens a new TCP connection on every call and is slower than a sync client at scale. awaiton the HTTP call — yields the event loop so other requests proceed during the wait.asyncio.Semaphore— a backpressure valve. A semaphore is a counter that allows at most N coroutines to hold it simultaneously. Without this, under extreme load you pile up thousands of awaiting coroutines, exhaust file descriptors, and OOM. With it, excess requests queue cheaply in Python (not as OS threads) and are served in order.
Fan-out with asyncio.gather
When one user request needs multiple LLM calls (e.g., calling a summariser and a classifier in parallel), use asyncio.gather:
summary, label = await asyncio.gather(
call_llm(summarise_prompt),
call_llm(classify_prompt),
)
Both calls are in flight simultaneously. Total latency is max(t_summary, t_classify), not the sum — often a 40-60% wall-clock saving.
The blocking-call trap
The single most common async mistake:
# WRONG — blocks the whole event loop for the duration of the call
@app.post("/bad")
async def bad(payload: dict):
import anthropic
client = anthropic.Anthropic() # sync client
resp = client.messages.create(...) # blocks: no await, loop frozen
return resp
# RIGHT — use the async client
@app.post("/good")
async def good(payload: dict):
client = anthropic.AsyncAnthropic() # async client
resp = await client.messages.create(...)
return resp
If you must call a blocking library from an async handler, offload it:
import asyncio, concurrent.futures
_pool = concurrent.futures.ThreadPoolExecutor(max_workers=10)
async def call_blocking_lib(args):
loop = asyncio.get_event_loop()
return await loop.run_in_executor(_pool, blocking_fn, args)
await, freeing the event loop to serve other requests. When the LLM responds, the loop resumes it. The semaphore prevents unbounded coroutine pile-up.Backpressure and what comes next
A semaphore is the simplest form of backpressure — pushing back on upstream demand so you don’t overwhelm a downstream resource. Without it, a traffic spike means thousands of coroutines pile up in memory, all hammering the LLM API’s rate limit simultaneously, and you get a cascade of 429s and OOM kills.
The semaphore makes excess requests queue in Python, consuming almost no resources, and drains them in order as capacity frees up. This is the pattern that separates a prototype from a production service.
From here, the natural next steps are:
- A request queue (Redis, SQS) to absorb burst and retry failures — separates receiving from processing.
- Rate limiting per tenant so one customer can’t starve others.
- Load balancing across multiple async workers (Uvicorn processes) to use all CPU cores — each process has its own event loop; a reverse proxy routes across them.
Quick check
Quick check
Practice this in an interview
All questionsasyncio is cooperative, single-threaded concurrency: coroutines yield control explicitly at await points, so there is no GIL contention and no shared-state races. Threads are preemptive OS-level concurrency: the scheduler can switch at any bytecode boundary, which requires explicit locking. Choose asyncio for high-fan-out I/O (thousands of connections); choose threads when you need to call blocking APIs you cannot rewrite.
Modern APIs offer constrained decoding — the model's token sampling is restricted to only produce tokens that are valid continuations of a JSON schema. Combined with Pydantic validation in application code, this eliminates the JSON-parsing errors that plagued earlier prompt-only approaches. When constrained decoding is unavailable, few-shot examples plus output parsing with retry is the fallback.
CPU-bound work keeps the processor busy the whole time — matrix multiplication, compression, parsing. I/O-bound work spends most of its time waiting for a slow external resource — network, disk, database. The distinction directly determines which concurrency primitive to reach for: multiprocessing for CPU-bound (bypasses the GIL), threading or asyncio for I/O-bound (GIL released during waits).
Cost scales with input plus output tokens; latency scales with output tokens and model size. The highest-leverage levers are: model routing (use a small model when the task is simple), prompt caching (reuse expensive prefix computation), output length control, and batching. Together these can cut spend 60–90% without quality regression.