Load balancing LLM inference
One GPU replica can't serve 10,000 users. How to spread LLM traffic across replicas — and why round-robin is the wrong default.
What you'll learn
- Why a single inference replica saturates (bounded concurrent sequences)
- Horizontal scaling: N replicas behind a load balancer
- Why round-robin overloads under variable request cost; least-outstanding wins
- Prefix-aware routing to reuse the KV cache
- Autoscaling on queue depth / GPU utilization, plus health checks and draining
Before you start
The ceiling of one replica
A serving engine such as vLLM handles concurrent sequences through continuous batching: requests share the GPU in the same forward pass, slotted in and out as they generate tokens. But this concurrency is bounded — the GPU has finite memory (for the KV cache) and finite compute. When the number of active sequences exceeds that bound, new arrivals wait in a queue and latency climbs. The ceiling is not a configuration you can tune away; it is a hardware limit.
The implication is straightforward: one replica cannot serve unlimited traffic. You need multiple replicas.
Key terms defined here:
- Replica — a full copy of the model loaded on its own GPU (or GPU set), running its own inference server process. Each replica is independent and can serve requests in parallel.
- KV cache — the per-token attention state a transformer builds during a request; it lives in GPU VRAM and grows with sequence length. It is what limits how many concurrent sequences one replica can hold.
Horizontal scaling: N replicas behind a load balancer
The standard pattern is a load balancer (a reverse proxy or a smart router) that sits in front of N identical replicas. Clients talk to the load balancer; it forwards each request to one replica and returns the response.
This is standard horizontal scaling. The load balancer gives clients one stable endpoint; replicas can be added or removed transparently. But the routing strategy inside the load balancer matters enormously for LLMs — in ways that are easy to get wrong.
Why round-robin fails for LLMs
In a typical web backend, requests are cheap and roughly equal in cost — serving a product page or a database read takes milliseconds, and one request looks much like another. Round-robin (send the first request to replica 1, second to replica 2, third to replica 3, cycle back) works fine there because costs are homogeneous.
LLM requests are not homogeneous. A request asking for a one-sentence summary might finish in 20 tokens (~0.5 s). A request asking for a 3,000-word technical document might generate 4,000 tokens (~80 s). The ratio of compute and time is roughly 100x. Under round-robin, the scheduler doesn’t know this: it treats both requests as one unit.
The result is the same failure you see in a grocery store checkout when one cashier gets a cart with 200 items and the next gets three: one lane backs up while the other sits idle. In the language of systems: head-of-line blocking and tail latency explosion.
A better mental model is the checkout analogy: look at the length of each line and join the shortest one. That intuition maps to a real algorithm.
Least-outstanding-requests (LOR) (sometimes called least-connections): the load balancer tracks how many in-flight requests each replica currently holds. Each new request goes to whichever replica has the fewest in-flight, regardless of order. A replica finishing its heavy 4,000-token job immediately becomes the least-loaded and attracts the next arrival. This self-corrects automatically.
More sophisticated variants weight by estimated tokens remaining (requires the replica to expose queue-depth or token-count metrics) or use an EWMA (exponentially weighted moving average) of recent latency per replica. Both outperform round-robin on variable-cost LLM workloads.
Prefix-aware routing and the KV cache
There is a third routing concern specific to LLMs: KV-cache reuse.
When a request arrives, the inference server must compute attention keys and values for every token in the prompt. For a 2,000-token system prompt, that is expensive. vLLM and other modern servers implement a prefix cache: if a new request shares the same leading tokens (same prefix) as a recently-served request, the KV values for those tokens are already computed and cached in GPU memory. The new request skips that computation entirely — a dramatic latency and cost win for long shared prefixes.
The catch: the prefix cache lives on a specific replica’s GPU. If the load balancer routes requests with the same system prompt to different replicas, each replica independently computes the same KV values, and you get zero cache benefit.
Prefix-aware (session-affinity) routing solves this: hash the request’s shared prefix (system prompt text, or a conversation ID) and route all requests with the same prefix hash to the same replica. Requests with the same prefix converge to one replica, the cache is warm there, and every subsequent request in that session gets a cache hit.
The trade-off is distribution: if one prefix becomes very popular (e.g., your most common system prompt drives 80% of traffic), the replica assigned to it becomes a hot spot. Production systems balance these concerns with consistent hashing (which redistributes only a fraction of keys when replicas are added/removed) or a hybrid policy (prefer same-replica; fall back to least-outstanding when the preferred replica is overloaded).
Autoscaling: what signal to use
Horizontal scaling also means dynamic horizontal scaling: adding replicas when traffic grows and removing them when it shrinks.
The right signal for LLM autoscaling is not CPU utilization. The bottleneck is the GPU. The correct signals are:
- Queue depth — the number of requests waiting to be assigned to a replica. If the queue is consistently non-empty, you need more replicas. Queue depth is a leading signal: it rises before latency does.
- GPU utilization or tokens/sec per replica — a lagging signal, but confirms replicas are at capacity.
Model loading is slow. Loading a 70B model’s weights from storage into GPU VRAM takes 30 seconds to several minutes, depending on storage bandwidth and quantization. This means autoscaling is not a real-time tool for absorbing sudden bursts. Best practices:
- Keep at least one warm spare replica (loaded and ready but serving minimal traffic). New capacity appears in seconds rather than minutes.
- Scale on a leading signal (queue depth threshold) early enough that the new replica is ready before latency SLOs are breached.
- Use a cooldown window on scale-down to avoid thrashing.
Health checks and draining
Any load balancer setup needs two types of health checks:
- Liveness — is the process alive? If not, restart it. Checked frequently (every 5-10 s).
- Readiness — is the replica ready to accept traffic? A replica that has just started and is loading model weights should be liveness-healthy but readiness-unhealthy. The load balancer only routes to readiness-healthy replicas.
When you scale down a replica (or roll a new model version), you must drain it first: stop the load balancer from sending new requests to that replica, but allow in-flight requests to finish. An LLM generation that has already started may take 30-120 seconds; killing the replica immediately corrupts those responses. Graceful draining with a timeout (wait up to N seconds, then force) is standard practice.
Putting it together
A production LLM serving setup looks like this:
- N replicas running vLLM (or equivalent), each on its own GPU set.
- Load balancer using least-outstanding-requests as the default routing policy.
- Prefix-aware routing layered on top if you have long shared system prompts or multi-turn conversations — consistent-hash the prefix to a replica, fall back to LOR when that replica is overloaded.
- Autoscaler watching
vllm:num_requests_waiting(queue depth) and GPU utilization; keeping 1-2 warm spares; respecting a scale-down draining window. - Readiness gates so new replicas only enter the load balancer pool after the model is fully loaded and the server passes a warmup request.
Quick check
Quick check
Practice this in an interview
All questionsML inference services should scale on request queue depth or GPU utilization rather than CPU utilization alone, because GPU-heavy workloads keep CPU near-idle even under full load. Horizontal Pod Autoscaler in Kubernetes can be configured with custom metrics, and scale-to-zero with a warm-up buffer prevents cold-start latency spikes.
GPUs execute tensor operations efficiently only when the batch dimension is large enough to saturate all CUDA cores. Dynamic batching collects individual requests arriving within a short window and fuses them into a single GPU call, dramatically improving throughput and cost efficiency without sacrificing per-request latency beyond the configured wait threshold.
Cost scales with input plus output tokens; latency scales with output tokens and model size. The highest-leverage levers are: model routing (use a small model when the task is simple), prompt caching (reuse expensive prefix computation), output length control, and batching. Together these can cut spend 60–90% without quality regression.
Latency is the time to serve a single request; throughput is the number of requests served per second. They are in tension because batching requests improves GPU utilization and throughput but adds queuing delay. The design goal is to meet the latency SLA at the highest possible throughput.