Just enough Kubernetes for an ML engineer
Pods, Deployments, Services, GPU scheduling, and the HPA — the four primitives that explain 90% of what 'we deploy on K8s' actually means. Plus an honest take on how little raw YAML you should be writing today.
What you'll learn
- The four primitives — Pod, Deployment, Service, Ingress — and what each is responsible for
- GPU requests via `resources.limits.nvidia.com/gpu` and how scheduling actually works
- HorizontalPodAutoscaler for traffic-based scaling, and when it bites you
- When to write raw YAML vs. reach for Helm, KServe, or managed (GKE Autopilot, EKS Fargate)
Before you start
Friday afternoon, your recommendation model is suddenly serving 8x
normal traffic — a celebrity tweeted about your product. The platform
team Slacks you: “your pods are at 95% GPU, latency is up to 2 seconds,
scale up.” You stare at kubectl and wonder which of the seventeen
YAML files in infra/k8s/ you’re supposed to edit.
Here’s the short version: you need to know four primitives to read and write the YAML that runs your service. Everything else is built on top of those four. Once they click, the rest is API surface.
The four primitives
| Primitive | What it is | Why an ML engineer cares |
|---|---|---|
| Pod | One or more containers scheduled together on a node | The unit. Your model server is a container in a pod. |
| Deployment | A controller that keeps N pods running, rolling out new versions safely | Replicas, rolling updates, rollbacks |
| Service | A stable DNS name and IP that load-balances across the pods | ”How does anything find my model?” |
| Ingress | HTTP routing from outside the cluster to services inside | Public URLs, TLS, path-based routing |
Pods are mortal. They get killed, restarted, moved to new nodes. The Deployment makes sure there are always N of them. The Service hides their churn behind a stable name. The Ingress exposes that name to the outside world.
That’s the whole topology of 90% of ML serving deployments.
A Deployment + Service for a FastAPI model server
Here’s a complete, production-shaped YAML for the FastAPI service from the previous lesson. GPU-requesting, with sensible probes and resource limits.
# k8s/churn-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: churn-predictor
labels:
app: churn-predictor
spec:
replicas: 3
selector:
matchLabels:
app: churn-predictor
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0 # zero-downtime: start a new one before killing an old one
template:
metadata:
labels:
app: churn-predictor
spec:
containers:
- name: server
image: ghcr.io/yourorg/churn-service:0.3.1 # never use :latest
ports:
- containerPort: 8000
resources:
requests:
cpu: "500m"
memory: "1Gi"
nvidia.com/gpu: 1 # scheduler will only place this on a GPU node
limits:
cpu: "2"
memory: "4Gi"
nvidia.com/gpu: 1
livenessProbe:
httpGet: { path: /healthz, port: 8000 }
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet: { path: /readyz, port: 8000 }
initialDelaySeconds: 10
periodSeconds: 5
imagePullSecrets:
- name: ghcr-pull-secret
---
apiVersion: v1
kind: Service
metadata:
name: churn-predictor
spec:
selector:
app: churn-predictor
ports:
- name: http
port: 80
targetPort: 8000
type: ClusterIP # ClusterIP: reachable only inside the cluster (the default)
Read it once, then we’ll pick the load-bearing parts.
What the scheduler actually does
When you kubectl apply -f churn-deployment.yaml, the API server
records your intent. The Deployment controller creates a ReplicaSet,
which creates 3 Pod objects. Each Pod is Pending until the scheduler
assigns it to a node.
The scheduler walks every node and asks: does this node have at least
500m CPU available, 1 GiB memory, and one free nvidia.com/gpu? If
yes, the Pod can run there. If no, the Pod stays Pending — and if no
node ever has capacity, you’ll watch your Pod sit there forever
wondering why “K8s is broken.”
You can simulate the placement logic in plain Python. It’s not magic.
The takeaway: K8s will not run two GPU pods on one node if you only
have one GPU there. Your resources.requests is a real, hard
constraint — not a hint. Misjudge it and your pods stay Pending.
Services — the stable address
A Pod’s IP changes every time it’s restarted. A Service gives the set of pods (matched by a label selector) a stable, internal DNS name. From inside the cluster:
http://churn-predictor.default.svc.cluster.local:80/predict
resolves to some pod from the deployment. The Service load-balances. If you scale up to 10 replicas, the Service immediately routes across all 10 — you don’t change anything on the client side.
For external traffic you front the Service with an Ingress:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: churn-predictor
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
ingressClassName: nginx
rules:
- host: churn.api.yourcompany.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: churn-predictor
port: { number: 80 }
tls:
- hosts: [churn.api.yourcompany.com]
secretName: churn-tls
Autoscaling on traffic — HPA
A HorizontalPodAutoscaler (HPA — “horizontal” means adding more pod replicas, as opposed to giving each pod more CPU/RAM) watches a metric and scales the Deployment’s replica count up or down. The metric is the signal: without a relevant one, the HPA can’t know when to act.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: churn-predictor
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: churn-predictor
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Two failure modes to watch:
- CPU isn’t the right signal for GPU-bound models. If your model sits 90% on the GPU and 10% on the CPU, CPU utilization stays low while latency creeps up. Use a custom metric (request latency, queue depth, or per-pod RPS via Prometheus + the Prometheus Adapter).
- HPA can’t scale faster than your model loads. If a pod takes 60 seconds to load a 4 GB model into GPU memory, the HPA can’t help you in a 30-second traffic spike. Combine with overprovisioning or pre-warming.
Production patterns worth knowing
A few labels that come up in every serious deployment:
-
Pod anti-affinity — spread replicas across nodes so a single node failure doesn’t kill the whole service.
affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchLabels: { app: churn-predictor } topologyKey: kubernetes.io/hostname -
PodDisruptionBudget — prevents voluntary disruptions (cluster upgrades, node drains) from taking all your replicas down at once.
apiVersion: policy/v1 kind: PodDisruptionBudget metadata: { name: churn-predictor } spec: minAvailable: 2 selector: { matchLabels: { app: churn-predictor } } -
imagePullSecrets— when your image is in a private registry (GHCR, ECR, GCR), you reference a Secret holding the credentials. -
StatefulSets — like Deployments but pods get stable identities and persistent volumes. For ML this is mostly vector databases (Milvus, Qdrant, Weaviate self-hosted) and other stateful sidecars. Your stateless model server is a Deployment, not a StatefulSet.
What you should actually be writing
Here’s the part nobody puts in the K8s tutorial: most ML engineers don’t write raw Deployment YAML. They use:
- Helm charts — parameterised YAML templates. You write
values.yaml, not the Deployment. - KServe — a CRD that turns “I have a model and want to serve it” into a 20-line YAML, handling autoscaling, canaries, multi-model hosting, and GPU sharing.
- Managed serverless K8s — GKE Autopilot, EKS Fargate, ACI for AKS. You don’t manage nodes; you submit pods and pay per pod-second.
- Cloud-native serving — SageMaker Endpoints, Vertex Endpoints, Azure ML Online Endpoints. Not K8s under the hood (from your perspective), but the same mental model.
When you actually do need raw K8s
To be fair to K8s: there are real reasons to drop down to the primitives.
- You have cross-cutting requirements (specific node selectors, exotic GPU topologies, multi-tenancy isolation) that the higher-level abstractions don’t expose.
- You’re building the platform that other teams use — i.e. you’re the platform team, or you’re packaging KServe / a Helm chart for others.
- You’re integrating with cluster-specific infrastructure (service mesh, custom CNI, on-prem hardware) that a managed service doesn’t cover.
If none of those apply, ship to a managed endpoint or KServe and revisit when the abstraction actually limits you.
Quick check
Quick check
Next
You can serve a model on K8s. The next lesson is about the data half — feature stores, the offline/online split, and why your training features and serving features quietly drift apart.
Practice this in an interview
All questionsML inference services should scale on request queue depth or GPU utilization rather than CPU utilization alone, because GPU-heavy workloads keep CPU near-idle even under full load. Horizontal Pod Autoscaler in Kubernetes can be configured with custom metrics, and scale-to-zero with a warm-up buffer prevents cold-start latency spikes.
The ML lifecycle spans eight phases: problem framing, data collection and validation, feature engineering, training and experimentation, offline evaluation, deployment, production monitoring, and retirement or retraining. Each phase has distinct owners, artefacts, and failure modes that an MLOps practice must systematise.
Docker encapsulates the full runtime environment — OS libraries, Python version, system packages — so the model runs identically everywhere. ONNX provides a hardware- and framework-agnostic model format so a model trained in PyTorch can be executed by a high-performance runtime like ONNX Runtime without the training framework as a dependency.
ML CI/CD must validate not just code correctness but also model quality — automated retraining triggers, data validation, model evaluation gates, and canary deployment checks that standard software pipelines have no equivalent for. A regression in model AUC is as much a deployment failure as a 500 error.