datarekha
MLOps Hard Asked at NvidiaAsked at GoogleAsked at OpenAIAsked at AnthropicAsked at Databricks

How do you optimise GPU utilization for model serving, and what role does dynamic batching play?

The short answer

GPUs execute tensor operations efficiently only when the batch dimension is large enough to saturate all CUDA cores. Dynamic batching collects individual requests arriving within a short window and fuses them into a single GPU call, dramatically improving throughput and cost efficiency without sacrificing per-request latency beyond the configured wait threshold.

How to think about it

A GPU has thousands of CUDA cores that must all be fed work simultaneously to reach peak throughput. Serving one request at a time wastes 90 % or more of compute capacity because a batch of 1 has the same kernel-launch overhead as a batch of 64, but processes 64x fewer samples.

Dynamic batching mechanics:

The inference server maintains a queue. Arriving requests are held for at most max_queue_delay milliseconds (typically 1–10 ms). When either the batch size limit or the time limit is reached, the accumulated requests are concatenated into a single tensor and sent to the GPU together. Each request then receives its slice of the output.

# TorchServe batch config (config.properties)
# batch_size=32 means max items per GPU call
# max_batch_delay=5 means wait up to 5 ms to fill the batch

batch_size=32
max_batch_delay=5
# TorchServe handler: handle() receives a list of requests
def handle(self, data, context):
    # data is a list of up to batch_size dicts
    inputs = torch.stack([self.preprocess(d) for d in data])  # [B, ...]
    with torch.no_grad():
        outputs = self.model(inputs)                           # [B, ...]
    return [{"score": float(o)} for o in outputs]

Additional GPU optimisation techniques:

  • FP16 / BF16 inference — halves memory bandwidth, doubles effective throughput on Ampere+ GPUs with no perceptible accuracy loss for most models.
  • Tensor parallelism — split large model weight matrices across multiple GPUs (Megatron-LM, DeepSpeed). Required when the model does not fit in a single GPU’s VRAM.
  • KV cache management (LLMs) — vLLM’s PagedAttention allocates KV cache in non-contiguous blocks, preventing memory fragmentation and allowing higher concurrent request counts.
  • CUDA graphs — capture a fixed-shape forward pass as a CUDA graph and replay it without Python overhead. Reduces host-side latency by 20–40 % for fixed-batch-size scenarios.
# FP16 serving in PyTorch
model = model.half().eval().cuda()

@torch.inference_mode()
def infer(batch: torch.Tensor) -> torch.Tensor:
    return model(batch.half().cuda()).float()

Sizing rule: aim for 75–85 % GPU memory utilisation with your maximum expected batch size loaded. Below 60 % you are under-using the GPU; above 90 % you risk OOM errors during traffic spikes.

Keep practising

All MLOps questions

Explore further

Skip to content