How do you balance latency and throughput trade-offs when designing a model serving system?
Latency is the time to serve a single request; throughput is the number of requests served per second. They are in tension because batching requests improves GPU utilization and throughput but adds queuing delay. The design goal is to meet the latency SLA at the highest possible throughput.
How to think about it
Latency (p50, p95, p99) is what users experience. Throughput (requests per second, tokens per second) determines cost efficiency. On GPU hardware they conflict: a batch of 32 inputs uses the same kernel launch overhead as a batch of 1, so larger batches amortize fixed costs and raise throughput — but items in the batch must wait for the batch to fill before processing starts, raising latency.
Key levers:
Dynamic batching — accept requests for up to max_wait_ms or max_batch_size, whichever comes first. NVIDIA Triton, TorchServe, and vLLM all implement this. Tune max_wait_ms to keep p99 latency within SLA while saturating GPU cores.
Model concurrency — run multiple model instances on the same GPU (MIG partitioning or multiple CUDA streams). Increases throughput without increasing per-request latency when requests are small.
Continuous batching (LLMs) — instead of waiting for an entire batch to finish decoding, slot in new requests as positions free up. vLLM’s PagedAttention implements this, raising throughput by 3–10x for generation workloads.
Async pre/post-processing — tokenization, feature lookup, and response serialization on CPU threads should not block GPU inference. Pipeline stages overlap on separate threads.
# Triton dynamic batching config (config.pbtxt)
dynamic_batching {
preferred_batch_size: [8, 16, 32]
max_queue_delay_microseconds: 5000 # 5 ms max wait
}
instance_group [{ count: 2, kind: KIND_GPU }]
# Measure throughput vs latency with locust or wrk
wrk -t4 -c64 -d60s --latency http://localhost:8080/predict
# Adjust concurrency until p99 crosses your SLA — that is your operating point
Capacity planning rule of thumb: target GPU utilization at 70–80 %. Below 60 % you are over-provisioned; above 85 % queuing latency spikes non-linearly.