How does autoscaling work for ML inference services, and what metrics should drive it?

ML inference services should scale on request queue depth or GPU utilization rather than CPU utilization alone, because GPU-heavy workloads keep CPU near-idle even under full load. Horizontal Pod Autoscaler in Kubernetes can be configured with custom metrics, and scale-to-zero with a warm-up buffer prevents cold-start latency spikes.

Why use a pipeline orchestrator like Airflow or Kubeflow instead of cron scripts for ML workflows?

ML workflows are multi-step DAGs with dependencies, and an orchestrator gives you dependency management, retries, backfills, caching, observability, and lineage that chained cron jobs cannot. Airflow is a general-purpose task orchestrator defining DAGs in Python, while Kubeflow Pipelines is ML-native, passing typed artifacts between containerized steps on Kubernetes with conditional logic like deploy only if accuracy exceeds a threshold. Choosing depends on whether you need generic scheduling or ML-specific, container-based pipelines.

How do you attribute and control ML spend across teams and models (FinOps for ML)?

Apply FinOps to ML by tagging every workload (training jobs, endpoints, GPU pools) by team, model, and environment so cost is attributable, then track unit-economics metrics like cost per prediction or per training run rather than just total spend. Set budgets and alerts, identify idle GPUs and overprovisioned endpoints, and enforce guardrails like autoscaling and instance-type policies. The goal is continuous visibility and accountability so teams optimize cost without killing experimentation.

Walk me through the full ML lifecycle from problem definition to model retirement.

The ML lifecycle spans eight phases: problem framing, data collection and validation, feature engineering, training and experimentation, offline evaluation, deployment, production monitoring, and retirement or retraining. Each phase has distinct owners, artefacts, and failure modes that an MLOps practice must systematise.

Just enough Kubernetes for an ML engineer — MLOps

The last lesson used words it never defined — pod, node, scheduler, a container per step — because Kubeflow’s whole job is to hide them so you can think in DAGs. But hidden is not gone: the day a step sits stuck in Pending or a GPU job won’t schedule, the abstraction evaporates and you are debugging raw Kubernetes whether you learned it or not. We asked for just enough of that layer to survive the moment the magic stops. This is it — and it opens, as these things always do, mid-incident.

Friday afternoon, your recommendation model is suddenly serving 8x normal traffic — a celebrity tweeted about your product. The platform team Slacks you: “your pods are at 95% GPU, latency is up to 2 seconds, scale up.” You stare at kubectl and wonder which of the seventeen YAML files in infra/k8s/ you’re supposed to edit.

Here’s the short version: you need to know four primitives to read and write the YAML that runs your service. Everything else is built on top of those four. Once they click, the rest is API surface.

The four primitives

Primitive	What it is	Why an ML engineer cares
Pod	One or more containers scheduled together on a node	The unit. Your model server is a container in a pod.
Deployment	A controller that keeps N pods running, rolling out new versions safely	Replicas, rolling updates, rollbacks
Service	A stable DNS name and IP that load-balances across the pods	”How does anything find my model?”
Ingress	HTTP routing from outside the cluster to services inside	Public URLs, TLS, path-based routing

Pods are mortal. They get killed, restarted, moved to new nodes. The Deployment makes sure there are always N of them. The Service hides their churn behind a stable name. The Ingress exposes that name to the outside world.

That’s the whole topology of 90% of ML serving deployments.

resources.requestsoutside the clusterload-balances

An ML serving deployment, all four primitives plus HPA and the scheduler.

A Deployment + Service for a FastAPI model server

Here’s a complete, production-shaped YAML for the FastAPI service from the previous lesson. GPU-requesting, with sensible probes and resource limits.

# k8s/churn-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: churn-predictor
  labels:
    app: churn-predictor
spec:
  replicas: 3
  selector:
    matchLabels:
      app: churn-predictor
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0    # zero-downtime: start a new one before killing an old one
  template:
    metadata:
      labels:
        app: churn-predictor
    spec:
      containers:
        - name: server
          image: ghcr.io/yourorg/churn-service:0.3.1   # never use :latest
          ports:
            - containerPort: 8000
          resources:
            requests:
              cpu: "500m"
              memory: "1Gi"
              nvidia.com/gpu: 1       # scheduler will only place this on a GPU node
            limits:
              cpu: "2"
              memory: "4Gi"
              nvidia.com/gpu: 1
          livenessProbe:
            httpGet: { path: /healthz, port: 8000 }
            initialDelaySeconds: 30
            periodSeconds: 10
          readinessProbe:
            httpGet: { path: /readyz, port: 8000 }
            initialDelaySeconds: 10
            periodSeconds: 5
      imagePullSecrets:
        - name: ghcr-pull-secret
---
apiVersion: v1
kind: Service
metadata:
  name: churn-predictor
spec:
  selector:
    app: churn-predictor
  ports:
    - name: http
      port: 80
      targetPort: 8000
  type: ClusterIP    # ClusterIP: reachable only inside the cluster (the default)

Read it once, then we’ll pick the load-bearing parts.

What the scheduler actually does

When you kubectl apply -f churn-deployment.yaml, the API server records your intent. The Deployment controller creates a ReplicaSet, which creates 3 Pod objects. Each Pod is Pending until the scheduler assigns it to a node.

The scheduler walks every node and asks: does this node have at least 500m CPU available, 1 GiB memory, and one free nvidia.com/gpu? If yes, the Pod can run there. If no, the Pod stays Pending — and if no node ever has capacity, you’ll watch your Pod sit there forever wondering why “K8s is broken.”

You can simulate the placement logic in plain Python. It’s not magic.

# A toy scheduler. Same logic K8s uses to decide which node gets which pod —
# minus thousands of lines of edge-case handling.
from dataclasses import dataclass, field
from typing import List, Optional

@dataclass
class Resources:
    cpu_millicores: int   # 1 CPU = 1000m
    memory_mib: int
    gpus: int = 0

@dataclass
class Node:
    name: str
    capacity: Resources
    used: Resources = field(default_factory=lambda: Resources(0, 0, 0))

    def has_room_for(self, req: Resources) -> bool:
        return (
            self.used.cpu_millicores + req.cpu_millicores <= self.capacity.cpu_millicores
            and self.used.memory_mib + req.memory_mib  <= self.capacity.memory_mib
            and self.used.gpus + req.gpus              <= self.capacity.gpus
        )

    def schedule(self, req: Resources):
        self.used = Resources(
            self.used.cpu_millicores + req.cpu_millicores,
            self.used.memory_mib    + req.memory_mib,
            self.used.gpus          + req.gpus,
        )

@dataclass
class Pod:
    name: str
    request: Resources

def schedule(pods: List[Pod], nodes: List[Node]) -> List[tuple]:
    placements = []
    for pod in pods:
        # K8s default: pick the first node that fits (in reality it scores them).
        chosen: Optional[Node] = next((n for n in nodes if n.has_room_for(pod.request)), None)
        if chosen is None:
            placements.append((pod.name, "Pending — no node has capacity"))
        else:
            chosen.schedule(pod.request)
            placements.append((pod.name, chosen.name))
    return placements

# A small cluster: 2 CPU nodes, 1 GPU node.
nodes = [
    Node("cpu-node-1", Resources(cpu_millicores=4000, memory_mib=16384, gpus=0)),
    Node("cpu-node-2", Resources(cpu_millicores=4000, memory_mib=16384, gpus=0)),
    Node("gpu-node-1", Resources(cpu_millicores=8000, memory_mib=32768, gpus=4)),
]

# A churn predictor (3 replicas) — needs a GPU each.
pods = [
    Pod(f"churn-{i}", Resources(cpu_millicores=500, memory_mib=1024, gpus=1))
    for i in range(3)
] + [
    # A second deployment — CPU-only
    Pod(f"feature-extractor-{i}", Resources(cpu_millicores=1000, memory_mib=2048))
    for i in range(2)
]

for pod, where in schedule(pods, nodes):
    print(f"{pod:30s} -> {where}")

churn-0                        -> gpu-node-1
churn-1                        -> gpu-node-1
churn-2                        -> gpu-node-1
feature-extractor-0            -> cpu-node-1
feature-extractor-1            -> cpu-node-1

The placement isn’t arbitrary — it’s the resources.requests doing the deciding. All three churn pods demand a GPU, and only gpu-node-1 has any, so all three land there (it has 4 GPUs, room for exactly these three plus one to spare). The two CPU-only feature-extractor pods never even consider the GPU node; they fit on cpu-node-1 and stop. The takeaway the toy makes concrete: K8s will not squeeze a GPU pod onto a node without a free GPU, no matter how much CPU is idle there. resources.requests is a hard constraint, not a hint — ask for a GPU that no node can spare and your pod sits in Pending forever, which is the exact mystery this five-minute model demystifies.

Services — the stable address

A Pod’s IP changes every time it’s restarted. A Service gives the set of pods (matched by a label selector) a stable, internal DNS name. From inside the cluster:

http://churn-predictor.default.svc.cluster.local:80/predict

resolves to some pod from the deployment. The Service load-balances. If you scale up to 10 replicas, the Service immediately routes across all 10 — you don’t change anything on the client side.

For external traffic you front the Service with an Ingress:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: churn-predictor
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: nginx
  rules:
    - host: churn.api.yourcompany.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: churn-predictor
                port: { number: 80 }
  tls:
    - hosts: [churn.api.yourcompany.com]
      secretName: churn-tls

Autoscaling on traffic — HPA

A HorizontalPodAutoscaler (HPA — “horizontal” means adding more pod replicas, as opposed to giving each pod more CPU/RAM) watches a metric and scales the Deployment’s replica count up or down. The metric is the signal: without a relevant one, the HPA can’t know when to act.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: churn-predictor
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: churn-predictor
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

TryPod autoscaling

How HPA chases a utilisation target

targetCPUUtilization70%GPU pods20× cost

Current pods

—

peak —

Response time

—

peak —

Cost / 5 min (CPU)

—

avg 0.0 pods

Target util

70%

balanced

Set the CPU target, optionally enable GPU pods, then click Run.

Two failure modes to watch:

CPU isn’t the right signal for GPU-bound models. If your model sits 90% on the GPU and 10% on the CPU, CPU utilization stays low while latency creeps up. Use a custom metric (request latency, queue depth, or per-pod RPS via Prometheus + the Prometheus Adapter).
HPA can’t scale faster than your model loads. If a pod takes 60 seconds to load a 4 GB model into GPU memory, the HPA can’t help you in a 30-second traffic spike. Combine with overprovisioning or pre-warming.

Production patterns worth knowing

A few labels that come up in every serious deployment:

Pod anti-affinity — spread replicas across nodes so a single node failure doesn’t kill the whole service.

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchLabels: { app: churn-predictor }
          topologyKey: kubernetes.io/hostname

PodDisruptionBudget — prevents voluntary disruptions (cluster upgrades, node drains) from taking all your replicas down at once.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata: { name: churn-predictor }
spec:
  minAvailable: 2
  selector: { matchLabels: { app: churn-predictor } }

imagePullSecrets — when your image is in a private registry (GHCR, ECR, GCR), you reference a Secret holding the credentials.
StatefulSets — like Deployments but pods get stable identities and persistent volumes. For ML this is mostly vector databases (Milvus, Qdrant, Weaviate self-hosted) and other stateful sidecars. Your stateless model server is a Deployment, not a StatefulSet.

What you should actually be writing

Here’s the part nobody puts in the K8s tutorial: most ML engineers don’t write raw Deployment YAML. They use:

Helm charts — parameterised YAML templates. You write values.yaml, not the Deployment.
KServe — a CRD that turns “I have a model and want to serve it” into a 20-line YAML, handling autoscaling, canaries, multi-model hosting, and GPU sharing.
Managed serverless K8s — GKE Autopilot, EKS Fargate, ACI for AKS. You don’t manage nodes; you submit pods and pay per pod-second.
Cloud-native serving — SageMaker Endpoints, Vertex Endpoints, Azure ML Online Endpoints. Not K8s under the hood (from your perspective), but the same mental model.

When you actually do need raw K8s

To be fair to K8s: there are real reasons to drop down to the primitives.

You have cross-cutting requirements (specific node selectors, exotic GPU topologies, multi-tenancy isolation) that the higher-level abstractions don’t expose.
You’re building the platform that other teams use — i.e. you’re the platform team, or you’re packaging KServe / a Helm chart for others.
You’re integrating with cluster-specific infrastructure (service mesh, custom CNI, on-prem hardware) that a managed service doesn’t cover.

If none of those apply, ship to a managed endpoint or KServe and revisit when the abstraction actually limits you.

In one breath

Ninety percent of “we deploy on K8s” is four primitives plus an autoscaler: a Pod (the unit — your model server’s container), a Deployment (keeps N pods alive, rolls out new versions, rolls back), a Service (a stable DNS name load-balancing across the churning pods), and an Ingress (HTTP/TLS routing from outside in) — with resources.requests acting as a hard scheduling constraint (ask for a GPU no node has and the pod sits Pending forever) and an HPA scaling replicas on a metric, which must reflect real saturation (CPU is the wrong signal for a GPU-bound model); and the honest punchline is that most ML engineers shouldn’t hand-write any of this — KServe or a managed endpoint collapses it — but you must understand it to debug when it breaks.

Practice

Before the quiz, replay the scheduler output as a diagnosis. A teammate’s GPU training pod has been Pending for an hour and they swear “the cluster has tons of free CPU.” Using the hard-constraint idea the toy made concrete, explain in one sentence why free CPU is irrelevant and what single kubectl describe line would confirm it. Then the autoscaling trap: your GPU model server is melting under load at 2-second latency, but the HPA reports CPU at 12% and refuses to scale — what’s happening, and what metric should the HPA watch instead?

Quick check

0/3

Q1You set `resources.requests.nvidia.com/gpu: 1` on a Pod, but no nodes in the cluster have GPUs. What happens?

Q2Why does an HPA targeting CPU utilization sometimes fail to scale a GPU-bound model server even under heavy load?

Q3You have one ML model to serve and no platform team. What's the most pragmatic choice?

A question to carry forward

You can now place a model server on a cluster and keep it alive under load. But step back and ask what that running pod actually does on each request — and you hit a gap this whole infrastructure chapter has stepped around. A request arrives with a user id. The model doesn’t want a user id; it wants features — this user’s purchases in the last 30 days, their average session length, their churn-risk percentile. Those numbers have to come from somewhere, computed and ready, in the few milliseconds before the model can answer.

And here is where two threads from far back in this course collide. The serving pod needs features computed exactly the way the training pipeline computed them — or you get the training-serving skew that silently wrecks a model. So the question to carry forward is the data half of serving: where do request-time features come from, how are they kept fresh, and how do you guarantee the feature a model sees in production is byte-identical to the one it learned from? That system — the offline/online split that answers all three — is the feature store, and it is the next lesson.

Just enough Kubernetes for an ML engineer

What you'll learn

Before you start