datarekha
MLOps Medium Asked at GoogleAsked at AmazonAsked at UberAsked at DatabricksAsked at Seldon

How does autoscaling work for ML inference services, and what metrics should drive it?

The short answer

ML inference services should scale on request queue depth or GPU utilization rather than CPU utilization alone, because GPU-heavy workloads keep CPU near-idle even under full load. Horizontal Pod Autoscaler in Kubernetes can be configured with custom metrics, and scale-to-zero with a warm-up buffer prevents cold-start latency spikes.

How to think about it

Standard CPU-based autoscaling is the wrong signal for GPU inference. A model server process can be 5 % CPU while its GPU is 95 % saturated. Scaling on CPU utilization means you never scale out until the service collapses.

Right signals for ML autoscaling:

MetricToolWhen to use
GPU utilization (%)DCGM Exporter + PrometheusGeneral GPU serving
Request queue depthTriton / custom metricBursty traffic
Requests per secondPrometheus rate()Predictable load
Token generation throughputvLLM metricsLLM serving

Horizontal Pod Autoscaler (HPA) with custom metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: model-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: model-server
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Pods
      pods:
        metric:
          name: triton_queue_compute_input_duration_us
        target:
          type: AverageValue
          averageValue: "50000"   # 50 ms average queue time triggers scale-out

Scale-to-zero (Knative, KEDA) reduces cost during off-peak hours but requires a warm-up period. GPU containers take 30–120 seconds to start, load model weights, and reach readiness. Mitigate by keeping a minimum of 1 replica during business hours and scale-to-zero only overnight for batch-tolerant workloads.

Scale-down lag matters too: reduce --horizontal-pod-autoscaler-downscale-stabilization carefully. Premature scale-down followed by a traffic spike causes latency spikes while new pods warm up.

# Check current HPA status
kubectl get hpa model-server-hpa -o wide
# Output: TARGETS shows current/desired metric value

Keep practising

All MLOps questions

Explore further

Skip to content