Kubeflow Pipelines — ML workflows as Kubernetes-native DAGs
KFP v2 turns a Python function into a containerised pipeline step with typed inputs, output artifacts, and metadata tracking. It's powerful — and it's overkill more often than vendors will admit.
What you'll learn
- The KFP DSL — `@dsl.component` and `@dsl.pipeline` decorators in v2
- Typed artifacts — `Dataset`, `Model`, `Metrics` — and why they beat plain file paths
- Compiling a pipeline to YAML and submitting it to a cluster
- When Kubeflow earns its weight vs. when MLflow + Argo (or Dagster) wins
Before you start
Your fraud-detection retrain is a Makefile that calls four Python
scripts in order, scheduled by a Jenkins cron that’s been running since
2022. Last month the cluster’s GPU node went down mid-job and the
Makefile didn’t know to retry. Last week a teammate asked “which dataset
trained the model currently in prod?” and the answer was “let me grep
the Jenkins logs.” The data engineer wants Airflow. The platform team
wants Kubeflow. Someone on Slack just said “Dagster.”
This is the lesson where you figure out what Kubeflow actually buys you, and — equally important — when the answer is “less than the brochure suggests.”
The mental model
Kubeflow Pipelines (KFP) v2 is one thing: a way to express an ML workflow as a Kubernetes-native DAG, where every step is its own container, every input and output is a typed artifact tracked by the metadata store, and the whole graph compiles to a YAML file you submit to a cluster.
Compare with Airflow:
| Airflow | Kubeflow Pipelines | |
|---|---|---|
| Designed for | General-purpose data orchestration | ML pipelines specifically |
| Step unit | Python operator (or a KubernetesPodOperator) | A containerised component, always |
| Artifact tracking | You roll your own (XCom is too small) | Built-in — every output is versioned + lineage-tracked |
| Metadata store | None natively | ML Metadata (MLMD) — query “which model came from which dataset” |
| Reusability | DAG-shaped Python | Components are first-class, sharable across pipelines |
| K8s required | No | Yes — and an opinionated install at that |
The thing Kubeflow is good at is the artifact + metadata story. The thing it costs you is the operational burden of running Kubeflow itself.
The KFP v2 DSL
A KFP component is a Python function decorated with @dsl.component.
KFP packages it into a container, runs it as a pod, and routes its
inputs and outputs through the metadata store.
# pipeline.py — a real 3-step training pipeline
from kfp import dsl
from kfp.dsl import Dataset, Model, Metrics, Input, Output
@dsl.component(
base_image="python:3.12-slim",
packages_to_install=["scikit-learn==1.6.0", "pandas==2.2.3"],
)
def load_data(out_dataset: Output[Dataset]):
"""Pull rows, write a Parquet to the artifact path KFP gives us."""
import pandas as pd
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=5000, n_features=10, random_state=0)
df = pd.DataFrame(X, columns=[f"f{i}" for i in range(10)])
df["label"] = y
df.to_parquet(out_dataset.path)
out_dataset.metadata["rows"] = len(df)
out_dataset.metadata["features"] = 10
@dsl.component(
base_image="python:3.12-slim",
packages_to_install=["scikit-learn==1.6.0", "pandas==2.2.3", "joblib==1.4.2"],
)
def train(
dataset: Input[Dataset],
out_model: Output[Model],
n_estimators: int = 200,
):
import pandas as pd, joblib
from sklearn.ensemble import RandomForestClassifier
df = pd.read_parquet(dataset.path)
X = df.drop(columns=["label"]).values
y = df["label"].values
clf = RandomForestClassifier(n_estimators=n_estimators, random_state=0).fit(X, y)
joblib.dump(clf, out_model.path)
out_model.metadata["framework"] = "scikit-learn"
out_model.metadata["n_estimators"] = n_estimators
@dsl.component(
base_image="python:3.12-slim",
packages_to_install=["scikit-learn==1.6.0", "pandas==2.2.3", "joblib==1.4.2"],
)
def evaluate(
dataset: Input[Dataset],
model: Input[Model],
out_metrics: Output[Metrics],
):
import pandas as pd, joblib
from sklearn.metrics import f1_score, accuracy_score
df = pd.read_parquet(dataset.path)
X = df.drop(columns=["label"]).values
y = df["label"].values
clf = joblib.load(model.path)
pred = clf.predict(X)
out_metrics.log_metric("f1", float(f1_score(y, pred)))
out_metrics.log_metric("accuracy", float(accuracy_score(y, pred)))
@dsl.pipeline(
name="churn-training-v2",
description="Load → train → evaluate, with typed artifacts.",
)
def churn_pipeline(n_estimators: int = 200):
data = load_data()
trained = train(dataset=data.outputs["out_dataset"], n_estimators=n_estimators)
evaluate(
dataset=data.outputs["out_dataset"],
model=trained.outputs["out_model"],
)
Five things to notice:
- Each
@dsl.componentis a container. KFP builds the image (or uses one you pre-built), runs it as a Kubernetes pod, mounts in your inputs, and collects your outputs. Running each step in its own container gives you three things for free: step-level retry (a failed training step doesn’t re-run data loading), independent resource requests (training can request a GPU; evaluation uses CPU only), and reproducibility (the exact image + version is recorded in MLMD). Output[Dataset]etc. are typed artifacts. KFP gives the component a.pathto write to. You never construct paths yourself — the metadata store does it, and the URI gets recorded.out_dataset.metadata["rows"] = ...is searchable later. You can query MLMD: “show me every model trained on a dataset with rows > 10000.”- The pipeline function is pure DSL. It’s not running the code; it’s
building a graph.
data.outputs["out_dataset"]is a reference, not a value. - Hyperparameters become pipeline parameters.
n_estimatorsis a knob you flip at submit time, not a code edit.
Compile, then submit
The DSL compiles to a YAML file. That YAML is the artifact your cluster runs. You can check it into git, diff it across versions, and submit it from anywhere with cluster credentials.
# Compile to YAML — this is what your CI commits.
from kfp import compiler
compiler.Compiler().compile(
pipeline_func=churn_pipeline,
package_path="churn_pipeline.yaml",
)
# Submit a run (from a machine with cluster access)
from kfp.client import Client
client = Client(host="https://kubeflow.yourcompany.com/pipeline")
client.create_run_from_pipeline_package(
"churn_pipeline.yaml",
arguments={"n_estimators": 300},
experiment_name="churn-experiments",
)
That YAML is also how you schedule recurring runs (KFP has its own recurring-run mechanism), trigger from a CI job, or feed into a metadata-driven UI.
A runnable simulation
Since you can’t run KFP in the browser, here’s the same DAG executed inline — same component boundaries, same artifact handoff, no Kubernetes. It’s the cheapest way to validate your component logic before submitting to a cluster (where a bad component costs you a 30-second pod spin-up to find out).
That same code, decorated with @dsl.component / @dsl.pipeline and
submitted to a cluster, gives you containerised steps, retry semantics,
parallelism, and full lineage in the MLMD UI.
When Kubeflow is the right answer
The honest list — Kubeflow earns its weight when several of these are true:
- You’re already deep in Kubernetes. The cluster, the IAM, the networking, the storage classes already exist and your team knows them. The marginal cost of one more controller is small.
- You need GPU scheduling, autoscaling node pools, and bin-packing for expensive training jobs — and you’d be reimplementing it on top of another orchestrator anyway.
- You have multiple teams sharing components — a “preprocessing” component used by five pipelines — and you want a metadata store that answers “which models depend on this component version?”
- You need MLMD’s lineage queries for compliance / audit.
When it’s overkill
Equally honest. Reach for something lighter when:
- You have one team and five pipelines. The metadata store payoff is small.
- Your steps are short Python functions, not heavy distributed jobs. A pod-per-step has measurable overhead — 10–30 seconds of cold start per step is real.
- Nobody on your team owns the Kubeflow install. A neglected Kubeflow cluster is worse than no Kubeflow cluster.
- Your scheduler needs are basic: “run this nightly, retry on failure, alert on miss.” Airflow or a managed equivalent (MWAA, Cloud Composer) does that without Kubeflow’s surface area.
How Kubeflow compares to its neighbours
| Tool | What it is | Sweet spot |
|---|---|---|
| Kubeflow Pipelines | K8s-native DAG with artifact tracking | Big K8s shops, multi-team artifact reuse |
| MLflow Projects | A way to package a runnable training job | When tracking is your real need, projects is a bonus |
| Metaflow | Netflix-born, Python-first DAGs with AWS opinions | Data scientist ergonomics; lighter ops |
| Argo Workflows | General-purpose K8s DAG, not ML-specific | Pair with MLflow for the ML metadata layer |
| Dagster | Asset-based orchestrator (not task-based) | When you think in datasets not tasks |
| Airflow | The general-purpose scheduler everyone has | The pipeline that isn’t the ML training one |
There’s no single right answer. The cost of switching is real, so don’t switch unless you can name the specific capability you’re buying. “Our CTO wants Kubeflow” is not that.
Quick check
Quick check
Next
Kubeflow runs on Kubernetes. The just-enough-K8s lesson covers pods, deployments, services, and autoscaling — all the primitives that Kubeflow delegates to under the hood.
Practice this in an interview
All questionsAirflow models pipelines as Directed Acyclic Graphs (DAGs) of tasks, each with defined dependencies. The scheduler triggers DAG runs based on a cron schedule, passing each run a logical execution date rather than the wall-clock time. A backfill re-runs a DAG over a historical date range, allowing you to populate data for past periods after adding a new pipeline or fixing a bug — as long as tasks are idempotent.
ML inference services should scale on request queue depth or GPU utilization rather than CPU utilization alone, because GPU-heavy workloads keep CPU near-idle even under full load. Horizontal Pod Autoscaler in Kubernetes can be configured with custom metrics, and scale-to-zero with a warm-up buffer prevents cold-start latency spikes.
Docker encapsulates the full runtime environment — OS libraries, Python version, system packages — so the model runs identically everywhere. ONNX provides a hardware- and framework-agnostic model format so a model trained in PyTorch can be executed by a high-performance runtime like ONNX Runtime without the training framework as a dependency.
ML CI/CD must validate not just code correctness but also model quality — automated retraining triggers, data validation, model evaluation gates, and canary deployment checks that standard software pipelines have no equivalent for. A regression in model AUC is as much a deployment failure as a 500 error.