Why use a pipeline orchestrator like Airflow or Kubeflow instead of cron scripts for ML workflows?

ML workflows are multi-step DAGs with dependencies, and an orchestrator gives you dependency management, retries, backfills, caching, observability, and lineage that chained cron jobs cannot. Airflow is a general-purpose task orchestrator defining DAGs in Python, while Kubeflow Pipelines is ML-native, passing typed artifacts between containerized steps on Kubernetes with conditional logic like deploy only if accuracy exceeds a threshold. Choosing depends on whether you need generic scheduling or ML-specific, container-based pipelines.

How does Apache Airflow work, and what is a DAG backfill?

Airflow models pipelines as Directed Acyclic Graphs (DAGs) of tasks, each with defined dependencies. The scheduler triggers DAG runs based on a cron schedule, passing each run a logical execution date rather than the wall-clock time. A backfill re-runs a DAG over a historical date range, allowing you to populate data for past periods after adding a new pipeline or fixing a bug — as long as tasks are idempotent.

How do you attribute and control ML spend across teams and models (FinOps for ML)?

Apply FinOps to ML by tagging every workload (training jobs, endpoints, GPU pools) by team, model, and environment so cost is attributable, then track unit-economics metrics like cost per prediction or per training run rather than just total spend. Set budgets and alerts, identify idle GPUs and overprovisioned endpoints, and enforce guardrails like autoscaling and instance-type policies. The goal is continuous visibility and accountability so teams optimize cost without killing experimentation.

How does autoscaling work for ML inference services, and what metrics should drive it?

ML inference services should scale on request queue depth or GPU utilization rather than CPU utilization alone, because GPU-heavy workloads keep CPU near-idle even under full load. Horizontal Pod Autoscaler in Kubernetes can be configured with custom metrics, and scale-to-zero with a warm-up buffer prevents cold-start latency spikes.

Kubeflow Pipelines — ML workflows as Kubernetes-native DAGs — MLOps

The last lesson left us at the edge of the managed garden. A one-click endpoint is wonderful until you need a multi-step training pipeline with your own containers, components shared across teams, and full lineage — at which point you stop consuming the managed platform and start building on the Kubernetes layer it was hiding. We asked how to run real ML pipelines on your own cluster without hand-writing a thousand lines of YAML. The ML-native answer is Kubeflow, and this lesson is an honest look at when it’s worth its weight.

Your fraud-detection retrain is a Makefile that calls four Python scripts in order, scheduled by a Jenkins cron that’s been running since 2022. Last month the cluster’s GPU node went down mid-job and the Makefile didn’t know to retry. Last week a teammate asked “which dataset trained the model currently in prod?” and the answer was “let me grep the Jenkins logs.” The data engineer wants Airflow. The platform team wants Kubeflow. Someone on Slack just said “Dagster.”

This is the lesson where you figure out what Kubeflow actually buys you, and — equally important — when the answer is “less than the brochure suggests.”

The mental model

Kubeflow Pipelines (KFP) v2 is one thing: a way to express an ML workflow as a Kubernetes-native DAG, where every step is its own container, every input and output is a typed artifact tracked by the metadata store, and the whole graph compiles to a YAML file you submit to a cluster.

Compare with Airflow:

	Airflow	Kubeflow Pipelines
Designed for	General-purpose data orchestration	ML pipelines specifically
Step unit	Python operator (or a `KubernetesPodOperator`)	A containerised component, always
Artifact tracking	You roll your own (XCom is too small)	Built-in — every output is versioned + lineage-tracked
Metadata store	None natively	ML Metadata (MLMD) — query “which model came from which dataset”
Reusability	DAG-shaped Python	Components are first-class, sharable across pipelines
K8s required	No	Yes — and an opinionated install at that

The thing Kubeflow is good at is the artifact + metadata story. The thing it costs you is the operational burden of running Kubeflow itself.

A KFP v2 pipeline as a DAG of containerised steps and typed artifacts.

The KFP v2 DSL

A KFP component is a Python function decorated with @dsl.component. KFP packages it into a container, runs it as a pod, and routes its inputs and outputs through the metadata store.

# pipeline.py — a real 3-step training pipeline
from kfp import dsl
from kfp.dsl import Dataset, Model, Metrics, Input, Output

@dsl.component(
    base_image="python:3.12-slim",
    packages_to_install=["scikit-learn==1.6.0", "pandas==2.2.3"],
)
def load_data(out_dataset: Output[Dataset]):
    """Pull rows, write a Parquet to the artifact path KFP gives us."""
    import pandas as pd
    from sklearn.datasets import make_classification

    X, y = make_classification(n_samples=5000, n_features=10, random_state=0)
    df = pd.DataFrame(X, columns=[f"f{i}" for i in range(10)])
    df["label"] = y
    df.to_parquet(out_dataset.path)
    out_dataset.metadata["rows"] = len(df)
    out_dataset.metadata["features"] = 10

@dsl.component(
    base_image="python:3.12-slim",
    packages_to_install=["scikit-learn==1.6.0", "pandas==2.2.3", "joblib==1.4.2"],
)
def train(
    dataset: Input[Dataset],
    out_model: Output[Model],
    n_estimators: int = 200,
):
    import pandas as pd, joblib
    from sklearn.ensemble import RandomForestClassifier

    df = pd.read_parquet(dataset.path)
    X = df.drop(columns=["label"]).values
    y = df["label"].values

    clf = RandomForestClassifier(n_estimators=n_estimators, random_state=0).fit(X, y)
    joblib.dump(clf, out_model.path)

    out_model.metadata["framework"] = "scikit-learn"
    out_model.metadata["n_estimators"] = n_estimators

@dsl.component(
    base_image="python:3.12-slim",
    packages_to_install=["scikit-learn==1.6.0", "pandas==2.2.3", "joblib==1.4.2"],
)
def evaluate(
    dataset: Input[Dataset],
    model: Input[Model],
    out_metrics: Output[Metrics],
):
    import pandas as pd, joblib
    from sklearn.metrics import f1_score, accuracy_score

    df = pd.read_parquet(dataset.path)
    X = df.drop(columns=["label"]).values
    y = df["label"].values

    clf = joblib.load(model.path)
    pred = clf.predict(X)

    out_metrics.log_metric("f1", float(f1_score(y, pred)))
    out_metrics.log_metric("accuracy", float(accuracy_score(y, pred)))

@dsl.pipeline(
    name="churn-training-v2",
    description="Load → train → evaluate, with typed artifacts.",
)
def churn_pipeline(n_estimators: int = 200):
    data = load_data()
    trained = train(dataset=data.outputs["out_dataset"], n_estimators=n_estimators)
    evaluate(
        dataset=data.outputs["out_dataset"],
        model=trained.outputs["out_model"],
    )

Five things to notice:

Each @dsl.component is a container. KFP builds the image (or uses one you pre-built), runs it as a Kubernetes pod, mounts in your inputs, and collects your outputs. Running each step in its own container gives you three things for free: step-level retry (a failed training step doesn’t re-run data loading), independent resource requests (training can request a GPU; evaluation uses CPU only), and reproducibility (the exact image + version is recorded in MLMD).
Output[Dataset] etc. are typed artifacts. KFP gives the component a .path to write to. You never construct paths yourself — the metadata store does it, and the URI gets recorded.
out_dataset.metadata["rows"] = ... is searchable later. You can query MLMD: “show me every model trained on a dataset with rows > 10000.”
The pipeline function is pure DSL. It’s not running the code; it’s building a graph. data.outputs["out_dataset"] is a reference, not a value.
Hyperparameters become pipeline parameters. n_estimators is a knob you flip at submit time, not a code edit.

Compile, then submit

The DSL compiles to a YAML file. That YAML is the artifact your cluster runs. You can check it into git, diff it across versions, and submit it from anywhere with cluster credentials.

# Compile to YAML — this is what your CI commits.
from kfp import compiler
compiler.Compiler().compile(
    pipeline_func=churn_pipeline,
    package_path="churn_pipeline.yaml",
)

# Submit a run (from a machine with cluster access)
from kfp.client import Client
client = Client(host="https://kubeflow.yourcompany.com/pipeline")
client.create_run_from_pipeline_package(
    "churn_pipeline.yaml",
    arguments={"n_estimators": 300},
    experiment_name="churn-experiments",
)

That YAML is also how you schedule recurring runs (KFP has its own recurring-run mechanism), trigger from a CI job, or feed into a metadata-driven UI.

A runnable simulation

Since you can’t run KFP in the browser, here’s the same DAG executed inline — same component boundaries, same artifact handoff, no Kubernetes. It’s the cheapest way to validate your component logic before submitting to a cluster (where a bad component costs you a 30-second pod spin-up to find out).

# A KFP-shaped pipeline executed inline. Same component boundaries,
# same Input/Output contracts, no Kubernetes required.
import tempfile, os
import pandas as pd, joblib
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, accuracy_score

# A tiny shim that imitates KFP's Output[Dataset] / Output[Model] artifacts:
class Artifact:
    def __init__(self, path):
        self.path = path
        self.metadata = {}
    def log_metric(self, k, v):
        self.metadata[k] = v

workdir = tempfile.mkdtemp()

def load_data(out_dataset: Artifact):
    X, y = make_classification(n_samples=5000, n_features=10, random_state=0)
    df = pd.DataFrame(X, columns=[f"f{i}" for i in range(10)])
    df["label"] = y
    df.to_parquet(out_dataset.path)
    out_dataset.metadata["rows"] = len(df)
    out_dataset.metadata["features"] = 10

def train(dataset: Artifact, out_model: Artifact, n_estimators=200):
    df = pd.read_parquet(dataset.path)
    X = df.drop(columns=["label"]).values
    y = df["label"].values
    clf = RandomForestClassifier(n_estimators=n_estimators, random_state=0).fit(X, y)
    joblib.dump(clf, out_model.path)
    out_model.metadata["framework"] = "scikit-learn"
    out_model.metadata["n_estimators"] = n_estimators

def evaluate(dataset: Artifact, model: Artifact, out_metrics: Artifact):
    df = pd.read_parquet(dataset.path)
    X = df.drop(columns=["label"]).values
    y = df["label"].values
    clf = joblib.load(model.path)
    pred = clf.predict(X)
    out_metrics.log_metric("f1", float(f1_score(y, pred)))
    out_metrics.log_metric("accuracy", float(accuracy_score(y, pred)))

# Wire the DAG the way @dsl.pipeline would
ds      = Artifact(os.path.join(workdir, "data.parquet"))
mdl     = Artifact(os.path.join(workdir, "model.joblib"))
metrics = Artifact(os.path.join(workdir, "metrics.json"))

load_data(ds)
train(ds, mdl, n_estimators=200)
evaluate(ds, mdl, metrics)

print("dataset metadata:", ds.metadata)
print("model metadata:  ", mdl.metadata)
print("metrics:         ", metrics.metadata)

dataset metadata: {'rows': 5000, 'features': 10}
model metadata:   {'framework': 'scikit-learn', 'n_estimators': 200}
metrics:          {'f1': 1.0, 'accuracy': 1.0}

Watch the metadata flow, because that’s the entire point — not the scores. Each step writes facts about its output (rows: 5000, n_estimators: 200, the metrics) onto the artifact it hands downstream, and in a real KFP run those facts land in the metadata store where “which dataset trained this model?” becomes a query instead of a Jenkins-log grep. (The f1: 1.0 is honest but not impressive: this toy evaluates the model on the same data it trained on, so a Random Forest memorises it — a real pipeline would hold out a test set. The plumbing is the lesson, not the perfect score.) Decorate these exact functions with @dsl.component / @dsl.pipeline and submit them, and you get containerised steps, step-level retries, parallelism, and that lineage for free.

When Kubeflow is the right answer

The honest list — Kubeflow earns its weight when several of these are true:

You’re already deep in Kubernetes. The cluster, the IAM, the networking, the storage classes already exist and your team knows them. The marginal cost of one more controller is small.
You need GPU scheduling, autoscaling node pools, and bin-packing for expensive training jobs — and you’d be reimplementing it on top of another orchestrator anyway.
You have multiple teams sharing components — a “preprocessing” component used by five pipelines — and you want a metadata store that answers “which models depend on this component version?”
You need MLMD’s lineage queries for compliance / audit.

When it’s overkill

Equally honest. Reach for something lighter when:

You have one team and five pipelines. The metadata store payoff is small.
Your steps are short Python functions, not heavy distributed jobs. A pod-per-step has measurable overhead — 10–30 seconds of cold start per step is real.
Nobody on your team owns the Kubeflow install. A neglected Kubeflow cluster is worse than no Kubeflow cluster.
Your scheduler needs are basic: “run this nightly, retry on failure, alert on miss.” Airflow or a managed equivalent (MWAA, Cloud Composer) does that without Kubeflow’s surface area.

How Kubeflow compares to its neighbours

Tool	What it is	Sweet spot
Kubeflow Pipelines	K8s-native DAG with artifact tracking	Big K8s shops, multi-team artifact reuse
MLflow Projects	A way to package a runnable training job	When tracking is your real need, projects is a bonus
Metaflow	Netflix-born, Python-first DAGs with AWS opinions	Data scientist ergonomics; lighter ops
Argo Workflows	General-purpose K8s DAG, not ML-specific	Pair with MLflow for the ML metadata layer
Dagster	Asset-based orchestrator (not task-based)	When you think in datasets not tasks
Airflow	The general-purpose scheduler everyone has	The pipeline that isn’t the ML training one

There’s no single right answer. The cost of switching is real, so don’t switch unless you can name the specific capability you’re buying. “Our CTO wants Kubeflow” is not that.

In one breath

Kubeflow Pipelines (KFP v2) expresses an ML workflow as a Kubernetes-native DAG where every step is a containerised @dsl.component (its own pod, with step-level retries and independent resource requests) and every input/output is a typed artifact — Dataset, Model, Metrics — tracked in a metadata store so lineage (“which dataset trained this model?”) is a query; the whole graph compiles to a YAML you submit to a cluster — and its real payoff is that artifact-and-metadata story, paid for with the operational burden of running Kubeflow itself, which is why for one team with five pipelines, MLflow plus a lighter orchestrator usually wins.

Practice

Before the quiz, sit with the DSL trap the lesson warns about. You write if data.outputs["out_dataset"].rows > 10000: inside the @dsl.pipeline function to branch on dataset size — and it doesn’t work. Explain why: what is that pipeline function actually doing when it “calls” load_data(), and where must real conditionals live instead? Then the judgment call: name two specific conditions under which Kubeflow earns its weight, and two under which “our CTO wants Kubeflow” is the only reason — and what you’d reach for instead.

Quick check

0/3

Q1What does the `Output[Dataset]` type in a KFP v2 component actually give you that a plain `str` path wouldn't?

Q2You have one ML team, five pipelines, and no existing Kubernetes expertise. Should you adopt Kubeflow?

Q3Why does the pipeline function in KFP look like it's calling each component but isn't actually running them?

A question to carry forward

Notice how much of this lesson leaned on words we never actually defined. “Each component is a pod.” “KFP runs it on a node.” “The cluster’s scheduler places it.” We used pod, node, scheduler, container-per-step as if they were obvious — and Kubeflow’s whole pitch is that it hides them so you can think in DAGs and artifacts instead.

But hidden is not gone. The day a KFP step sits stuck in Pending forever, or a GPU job won’t schedule, or a pod gets killed and restarted mid-run, the abstraction evaporates and you are debugging raw Kubernetes whether you learned it or not. So the question to carry forward is the layer underneath everything in this lesson: what is a pod, a deployment, a service, a node, the scheduler — the handful of Kubernetes primitives that Kubeflow (and every managed ML platform) delegates to under the hood, and that you need just enough of to debug when the magic stops? That is just enough Kubernetes for an ML engineer, and it is the next lesson.

Kubeflow Pipelines — ML workflows as Kubernetes-native DAGs

What you'll learn

Before you start