datarekha

Just enough Kubernetes for an ML engineer

Pods, Deployments, Services, GPU scheduling, and the HPA — the four primitives that explain 90% of what 'we deploy on K8s' actually means. Plus an honest take on how little raw YAML you should be writing today.

9 min read Advanced MLOps Lesson 16 of 17

What you'll learn

  • The four primitives — Pod, Deployment, Service, Ingress — and what each is responsible for
  • GPU requests via `resources.limits.nvidia.com/gpu` and how scheduling actually works
  • HorizontalPodAutoscaler for traffic-based scaling, and when it bites you
  • When to write raw YAML vs. reach for Helm, KServe, or managed (GKE Autopilot, EKS Fargate)

Before you start

Friday afternoon, your recommendation model is suddenly serving 8x normal traffic — a celebrity tweeted about your product. The platform team Slacks you: “your pods are at 95% GPU, latency is up to 2 seconds, scale up.” You stare at kubectl and wonder which of the seventeen YAML files in infra/k8s/ you’re supposed to edit.

Here’s the short version: you need to know four primitives to read and write the YAML that runs your service. Everything else is built on top of those four. Once they click, the rest is API surface.

The four primitives

PrimitiveWhat it isWhy an ML engineer cares
PodOne or more containers scheduled together on a nodeThe unit. Your model server is a container in a pod.
DeploymentA controller that keeps N pods running, rolling out new versions safelyReplicas, rolling updates, rollbacks
ServiceA stable DNS name and IP that load-balances across the pods”How does anything find my model?”
IngressHTTP routing from outside the cluster to services insidePublic URLs, TLS, path-based routing

Pods are mortal. They get killed, restarted, moved to new nodes. The Deployment makes sure there are always N of them. The Service hides their churn behind a stable name. The Ingress exposes that name to the outside world.

That’s the whole topology of 90% of ML serving deployments.

ClientHTTPSIngressTLS + hostServiceClusterIPPod · churn-prednvidia.com/gpu: 1readyPod · churn-prednvidia.com/gpu: 1readyPod · churn-prednvidia.com/gpu: 1readyHPAmin=2 max=20watches metrickube-schedulercontrol planeplaces pods on nodes that match resources.requestsoutside the clusterload-balances
An ML serving deployment, all four primitives plus HPA and the scheduler.

A Deployment + Service for a FastAPI model server

Here’s a complete, production-shaped YAML for the FastAPI service from the previous lesson. GPU-requesting, with sensible probes and resource limits.

# k8s/churn-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: churn-predictor
  labels:
    app: churn-predictor
spec:
  replicas: 3
  selector:
    matchLabels:
      app: churn-predictor
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0    # zero-downtime: start a new one before killing an old one
  template:
    metadata:
      labels:
        app: churn-predictor
    spec:
      containers:
        - name: server
          image: ghcr.io/yourorg/churn-service:0.3.1   # never use :latest
          ports:
            - containerPort: 8000
          resources:
            requests:
              cpu: "500m"
              memory: "1Gi"
              nvidia.com/gpu: 1       # scheduler will only place this on a GPU node
            limits:
              cpu: "2"
              memory: "4Gi"
              nvidia.com/gpu: 1
          livenessProbe:
            httpGet: { path: /healthz, port: 8000 }
            initialDelaySeconds: 30
            periodSeconds: 10
          readinessProbe:
            httpGet: { path: /readyz, port: 8000 }
            initialDelaySeconds: 10
            periodSeconds: 5
      imagePullSecrets:
        - name: ghcr-pull-secret
---
apiVersion: v1
kind: Service
metadata:
  name: churn-predictor
spec:
  selector:
    app: churn-predictor
  ports:
    - name: http
      port: 80
      targetPort: 8000
  type: ClusterIP    # ClusterIP: reachable only inside the cluster (the default)

Read it once, then we’ll pick the load-bearing parts.

What the scheduler actually does

When you kubectl apply -f churn-deployment.yaml, the API server records your intent. The Deployment controller creates a ReplicaSet, which creates 3 Pod objects. Each Pod is Pending until the scheduler assigns it to a node.

The scheduler walks every node and asks: does this node have at least 500m CPU available, 1 GiB memory, and one free nvidia.com/gpu? If yes, the Pod can run there. If no, the Pod stays Pending — and if no node ever has capacity, you’ll watch your Pod sit there forever wondering why “K8s is broken.”

You can simulate the placement logic in plain Python. It’s not magic.

The takeaway: K8s will not run two GPU pods on one node if you only have one GPU there. Your resources.requests is a real, hard constraint — not a hint. Misjudge it and your pods stay Pending.

Services — the stable address

A Pod’s IP changes every time it’s restarted. A Service gives the set of pods (matched by a label selector) a stable, internal DNS name. From inside the cluster:

http://churn-predictor.default.svc.cluster.local:80/predict

resolves to some pod from the deployment. The Service load-balances. If you scale up to 10 replicas, the Service immediately routes across all 10 — you don’t change anything on the client side.

For external traffic you front the Service with an Ingress:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: churn-predictor
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: nginx
  rules:
    - host: churn.api.yourcompany.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: churn-predictor
                port: { number: 80 }
  tls:
    - hosts: [churn.api.yourcompany.com]
      secretName: churn-tls

Autoscaling on traffic — HPA

A HorizontalPodAutoscaler (HPA — “horizontal” means adding more pod replicas, as opposed to giving each pod more CPU/RAM) watches a metric and scales the Deployment’s replica count up or down. The metric is the signal: without a relevant one, the HPA can’t know when to act.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: churn-predictor
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: churn-predictor
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Two failure modes to watch:

  1. CPU isn’t the right signal for GPU-bound models. If your model sits 90% on the GPU and 10% on the CPU, CPU utilization stays low while latency creeps up. Use a custom metric (request latency, queue depth, or per-pod RPS via Prometheus + the Prometheus Adapter).
  2. HPA can’t scale faster than your model loads. If a pod takes 60 seconds to load a 4 GB model into GPU memory, the HPA can’t help you in a 30-second traffic spike. Combine with overprovisioning or pre-warming.

Production patterns worth knowing

A few labels that come up in every serious deployment:

  • Pod anti-affinity — spread replicas across nodes so a single node failure doesn’t kill the whole service.

    affinity:
      podAntiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels: { app: churn-predictor }
              topologyKey: kubernetes.io/hostname
  • PodDisruptionBudget — prevents voluntary disruptions (cluster upgrades, node drains) from taking all your replicas down at once.

    apiVersion: policy/v1
    kind: PodDisruptionBudget
    metadata: { name: churn-predictor }
    spec:
      minAvailable: 2
      selector: { matchLabels: { app: churn-predictor } }
  • imagePullSecrets — when your image is in a private registry (GHCR, ECR, GCR), you reference a Secret holding the credentials.

  • StatefulSets — like Deployments but pods get stable identities and persistent volumes. For ML this is mostly vector databases (Milvus, Qdrant, Weaviate self-hosted) and other stateful sidecars. Your stateless model server is a Deployment, not a StatefulSet.

What you should actually be writing

Here’s the part nobody puts in the K8s tutorial: most ML engineers don’t write raw Deployment YAML. They use:

  • Helm charts — parameterised YAML templates. You write values.yaml, not the Deployment.
  • KServe — a CRD that turns “I have a model and want to serve it” into a 20-line YAML, handling autoscaling, canaries, multi-model hosting, and GPU sharing.
  • Managed serverless K8s — GKE Autopilot, EKS Fargate, ACI for AKS. You don’t manage nodes; you submit pods and pay per pod-second.
  • Cloud-native serving — SageMaker Endpoints, Vertex Endpoints, Azure ML Online Endpoints. Not K8s under the hood (from your perspective), but the same mental model.

When you actually do need raw K8s

To be fair to K8s: there are real reasons to drop down to the primitives.

  • You have cross-cutting requirements (specific node selectors, exotic GPU topologies, multi-tenancy isolation) that the higher-level abstractions don’t expose.
  • You’re building the platform that other teams use — i.e. you’re the platform team, or you’re packaging KServe / a Helm chart for others.
  • You’re integrating with cluster-specific infrastructure (service mesh, custom CNI, on-prem hardware) that a managed service doesn’t cover.

If none of those apply, ship to a managed endpoint or KServe and revisit when the abstraction actually limits you.

Quick check

Quick check

0/3
Q1You set `resources.requests.nvidia.com/gpu: 1` on a Pod, but no nodes in the cluster have GPUs. What happens?
Q2Why does an HPA targeting CPU utilization sometimes fail to scale a GPU-bound model server even under heavy load?
Q3You have one ML model to serve and no platform team. What's the most pragmatic choice?

Next

You can serve a model on K8s. The next lesson is about the data half — feature stores, the offline/online split, and why your training features and serving features quietly drift apart.

Practice this in an interview

All questions
How does autoscaling work for ML inference services, and what metrics should drive it?

ML inference services should scale on request queue depth or GPU utilization rather than CPU utilization alone, because GPU-heavy workloads keep CPU near-idle even under full load. Horizontal Pod Autoscaler in Kubernetes can be configured with custom metrics, and scale-to-zero with a warm-up buffer prevents cold-start latency spikes.

Walk me through the full ML lifecycle from problem definition to model retirement.

The ML lifecycle spans eight phases: problem framing, data collection and validation, feature engineering, training and experimentation, offline evaluation, deployment, production monitoring, and retirement or retraining. Each phase has distinct owners, artefacts, and failure modes that an MLOps practice must systematise.

How do Docker and ONNX complement each other for packaging and deploying ML models portably?

Docker encapsulates the full runtime environment — OS libraries, Python version, system packages — so the model runs identically everywhere. ONNX provides a hardware- and framework-agnostic model format so a model trained in PyTorch can be executed by a high-performance runtime like ONNX Runtime without the training framework as a dependency.

How does CI/CD for ML differ from standard software CI/CD, and what stages should an ML pipeline include?

ML CI/CD must validate not just code correctness but also model quality — automated retraining triggers, data validation, model evaluation gates, and canary deployment checks that standard software pipelines have no equivalent for. A regression in model AUC is as much a deployment failure as a 500 error.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content