How does autoscaling work for ML inference services, and what metrics should drive it?

ML inference services should scale on request queue depth or GPU utilization rather than CPU utilization alone, because GPU-heavy workloads keep CPU near-idle even under full load. Horizontal Pod Autoscaler in Kubernetes can be configured with custom metrics, and scale-to-zero with a warm-up buffer prevents cold-start latency spikes.

How would you reduce the cost of serving an ML or LLM model in production without hurting quality?

Work top-down: start at the model layer with quantization, distillation, or routing cheaper models for easy requests, since model choices drive every downstream cost. Then optimize the runtime with batching, caching, and techniques like prompt caching for LLMs, and finally match infrastructure to the load using autoscaling on queue depth and spot or batch capacity. Track cost per token or per prediction alongside latency percentiles and accuracy so optimizations never silently degrade quality.

What is LLM model routing and how does an LLM cascade work?

Model routing sends each query to the most appropriate model based on difficulty, cost, or capability, instead of always using the largest model. A cascade is a sequential form: try the cheapest or smallest model first and only escalate to a larger model if the answer fails a quality or confidence check, reducing average cost while preserving quality on hard queries.

How do you optimise GPU utilization for model serving, and what role does dynamic batching play?

GPUs execute tensor operations efficiently only when the batch dimension is large enough to saturate all CUDA cores. Dynamic batching collects individual requests arriving within a short window and fuses them into a single GPU call, dramatically improving throughput and cost efficiency without sacrificing per-request latency beyond the configured wait threshold.

Load balancing LLM inference — Generative AI

The ceiling of one replica

A serving engine such as vLLM handles concurrent sequences through continuous batching: requests share the GPU in the same forward pass, slotted in and out as they generate tokens. But this concurrency is bounded — the GPU has finite memory (for the KV cache) and finite compute. When the number of active sequences exceeds that bound, new arrivals wait in a queue and latency climbs. The ceiling is not a configuration you can tune away; it is a hardware limit.

The implication is straightforward: one replica cannot serve unlimited traffic. You need multiple replicas.

Key terms defined here:

Replica — a full copy of the model loaded on its own GPU (or GPU set), running its own inference server process. Each replica is independent and can serve requests in parallel.
KV cache — the per-token attention state a transformer builds during a request; it lives in GPU VRAM and grows with sequence length. It is what limits how many concurrent sequences one replica can hold.

Horizontal scaling: N replicas behind a load balancer

The standard pattern is a load balancer (a reverse proxy or a smart router) that sits in front of N identical replicas. Clients talk to the load balancer; it forwards each request to one replica and returns the response.

Clients talk to a single load balancer endpoint; the balancer fans traffic out to N independent GPU replicas, each running a full copy of the model.

This is standard horizontal scaling. The load balancer gives clients one stable endpoint; replicas can be added or removed transparently. But the routing strategy inside the load balancer matters enormously for LLMs — in ways that are easy to get wrong.

Why round-robin fails for LLMs

In a typical web backend, requests are cheap and roughly equal in cost — serving a product page or a database read takes milliseconds, and one request looks much like another. Round-robin (send the first request to replica 1, second to replica 2, third to replica 3, cycle back) works fine there because costs are homogeneous.

LLM requests are not homogeneous. A request asking for a one-sentence summary might finish in 20 tokens (~0.5 s). A request asking for a 3,000-word technical document might generate 4,000 tokens (~80 s). The ratio of compute and time is roughly 100x. Under round-robin, the scheduler doesn’t know this: it treats both requests as one unit.

The result is the same failure you see in a grocery store checkout when one cashier gets a cart with 200 items and the next gets three: one lane backs up while the other sits idle. In the language of systems: head-of-line blocking and tail latency explosion.

A better mental model is the checkout analogy: look at the length of each line and join the shortest one. That intuition maps to a real algorithm.

Least-outstanding-requests (LOR) (sometimes called least-connections): the load balancer tracks how many in-flight requests each replica currently holds. Each new request goes to whichever replica has the fewest in-flight, regardless of order. A replica finishing its heavy 4,000-token job immediately becomes the least-loaded and attracts the next arrival. This self-corrects automatically.

More sophisticated variants weight by estimated tokens remaining (requires the replica to expose queue-depth or token-count metrics) or use an EWMA (exponentially weighted moving average) of recent latency per replica. Both outperform round-robin on variable-cost LLM workloads.

Round-robin (top) ignores request cost: long requests pile onto one replica while another sits idle, spiking p99. Least-outstanding-requests (bottom) tracks in-flight count and routes to the least-busy replica, keeping load balanced regardless of request length.

Prefix-aware routing and the KV cache

There is a third routing concern specific to LLMs: KV-cache reuse.

When a request arrives, the inference server must compute attention keys and values for every token in the prompt. For a 2,000-token system prompt, that is expensive. vLLM and other modern servers implement a prefix cache: if a new request shares the same leading tokens (same prefix) as a recently-served request, the KV values for those tokens are already computed and cached in GPU memory. The new request skips that computation entirely — a dramatic latency and cost win for long shared prefixes.

The catch: the prefix cache lives on a specific replica’s GPU. If the load balancer routes requests with the same system prompt to different replicas, each replica independently computes the same KV values, and you get zero cache benefit.

Prefix-aware (session-affinity) routing solves this: hash the request’s shared prefix (system prompt text, or a conversation ID) and route all requests with the same prefix hash to the same replica. Requests with the same prefix converge to one replica, the cache is warm there, and every subsequent request in that session gets a cache hit.

The trade-off is distribution: if one prefix becomes very popular (e.g., your most common system prompt drives 80% of traffic), the replica assigned to it becomes a hot spot. Production systems balance these concerns with consistent hashing (which redistributes only a fraction of keys when replicas are added/removed) or a hybrid policy (prefer same-replica; fall back to least-outstanding when the preferred replica is overloaded).

Requests sharing the same prefix (P1) are hashed to Replica A, where the KV cache for that prefix is already warm — both get a cache hit. A request with a different prefix (P2) goes to Replica B, which computes it fresh. Without prefix-aware routing, all three replicas would recompute P1 independently.

Autoscaling: what signal to use

Horizontal scaling also means dynamic horizontal scaling: adding replicas when traffic grows and removing them when it shrinks.

The right signal for LLM autoscaling is not CPU utilization. The bottleneck is the GPU. The correct signals are:

Queue depth — the number of requests waiting to be assigned to a replica. If the queue is consistently non-empty, you need more replicas. Queue depth is a leading signal: it rises before latency does.
GPU utilization or tokens/sec per replica — a lagging signal, but confirms replicas are at capacity.

Model loading is slow. Loading a 70B model’s weights from storage into GPU VRAM takes 30 seconds to several minutes, depending on storage bandwidth and quantization. This means autoscaling is not a real-time tool for absorbing sudden bursts. Best practices:

Keep at least one warm spare replica (loaded and ready but serving minimal traffic). New capacity appears in seconds rather than minutes.
Scale on a leading signal (queue depth threshold) early enough that the new replica is ready before latency SLOs are breached.
Use a cooldown window on scale-down to avoid thrashing.

Health checks and draining

Any load balancer setup needs two types of health checks:

Liveness — is the process alive? If not, restart it. Checked frequently (every 5-10 s).
Readiness — is the replica ready to accept traffic? A replica that has just started and is loading model weights should be liveness-healthy but readiness-unhealthy. The load balancer only routes to readiness-healthy replicas.

When you scale down a replica (or roll a new model version), you must drain it first: stop the load balancer from sending new requests to that replica, but allow in-flight requests to finish. An LLM generation that has already started may take 30-120 seconds; killing the replica immediately corrupts those responses. Graceful draining with a timeout (wait up to N seconds, then force) is standard practice.

Putting it together

A production LLM serving setup looks like this:

N replicas running vLLM (or equivalent), each on its own GPU set.
Load balancer using least-outstanding-requests as the default routing policy.
Prefix-aware routing layered on top if you have long shared system prompts or multi-turn conversations — consistent-hash the prefix to a replica, fall back to LOR when that replica is overloaded.
Autoscaler watching vllm:num_requests_waiting (queue depth) and GPU utilization; keeping 1-2 warm spares; respecting a scale-down draining window.
Readiness gates so new replicas only enter the load balancer pool after the model is fully loaded and the server passes a warmup request.

Quick check

0/3

Q1You run two replicas behind a round-robin load balancer. Request lengths vary 100x (some 20-token completions, some 2,000-token ones). What happens to p99 latency?

Q2Why does prefix-aware routing improve latency for multi-turn conversations?

Q3Your autoscaler is configured to add replicas when CPU utilization exceeds 80%. During a traffic spike, users report high latency, but the autoscaler does not trigger. What is most likely wrong?

Load balancing LLM inference

What you'll learn

Before you start

The ceiling of one replica

Horizontal scaling: N replicas behind a load balancer

Why round-robin fails for LLMs

Prefix-aware routing and the KV cache

Autoscaling: what signal to use

Health checks and draining

Putting it together

Quick check

Quick check

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further