How does autoscaling work for ML inference services, and what metrics should drive it?
ML inference services should scale on request queue depth or GPU utilization rather than CPU utilization alone, because GPU-heavy workloads keep CPU near-idle even under full load. Horizontal Pod Autoscaler in Kubernetes can be configured with custom metrics, and scale-to-zero with a warm-up buffer prevents cold-start latency spikes.
How to think about it
Standard CPU-based autoscaling is the wrong signal for GPU inference. A model server process can be 5 % CPU while its GPU is 95 % saturated. Scaling on CPU utilization means you never scale out until the service collapses.
Right signals for ML autoscaling:
| Metric | Tool | When to use |
|---|---|---|
| GPU utilization (%) | DCGM Exporter + Prometheus | General GPU serving |
| Request queue depth | Triton / custom metric | Bursty traffic |
| Requests per second | Prometheus rate() | Predictable load |
| Token generation throughput | vLLM metrics | LLM serving |
Horizontal Pod Autoscaler (HPA) with custom metrics:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: model-server-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: model-server
minReplicas: 2
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: triton_queue_compute_input_duration_us
target:
type: AverageValue
averageValue: "50000" # 50 ms average queue time triggers scale-out
Scale-to-zero (Knative, KEDA) reduces cost during off-peak hours but requires a warm-up period. GPU containers take 30–120 seconds to start, load model weights, and reach readiness. Mitigate by keeping a minimum of 1 replica during business hours and scale-to-zero only overnight for batch-tolerant workloads.
Scale-down lag matters too: reduce --horizontal-pod-autoscaler-downscale-stabilization carefully. Premature scale-down followed by a traffic spike causes latency spikes while new pods warm up.
# Check current HPA status
kubectl get hpa model-server-hpa -o wide
# Output: TARGETS shows current/desired metric value