MLOps Medium Asked at GoogleAsked at AmazonAsked at UberAsked at DatabricksAsked at Seldon

How does autoscaling work for ML inference services, and what metrics should drive it?

For MLOps Engineer ML Engineer AI / LLM Engineer

The short answer

ML inference services should scale on request queue depth or GPU utilization rather than CPU utilization alone, because GPU-heavy workloads keep CPU near-idle even under full load. Horizontal Pod Autoscaler in Kubernetes can be configured with custom metrics, and scale-to-zero with a warm-up buffer prevents cold-start latency spikes.

How to think about it

Standard CPU-based autoscaling is the wrong signal for GPU inference. A model server process can be 5 % CPU while its GPU is 95 % saturated. Scaling on CPU utilization means you never scale out until the service collapses.

Right signals for ML autoscaling:

Metric	Tool	When to use
GPU utilization (%)	DCGM Exporter + Prometheus	General GPU serving
Request queue depth	Triton / custom metric	Bursty traffic
Requests per second	Prometheus `rate()`	Predictable load
Token generation throughput	vLLM metrics	LLM serving

Horizontal Pod Autoscaler (HPA) with custom metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: model-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: model-server
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Pods
      pods:
        metric:
          name: triton_queue_compute_input_duration_us
        target:
          type: AverageValue
          averageValue: "50000"   # 50 ms average queue time triggers scale-out

Scale-to-zero (Knative, KEDA) reduces cost during off-peak hours but requires a warm-up period. GPU containers take 30–120 seconds to start, load model weights, and reach readiness. Mitigate by keeping a minimum of 1 replica during business hours and scale-to-zero only overnight for batch-tolerant workloads.

Scale-down lag matters too: reduce --horizontal-pod-autoscaler-downscale-stabilization carefully. Premature scale-down followed by a traffic spike causes latency spikes while new pods warm up.

# Check current HPA status
kubectl get hpa model-server-hpa -o wide
# Output: TARGETS shows current/desired metric value

How does autoscaling work for ML inference services, and what metrics should drive it?

Keep practising

Explore further