What does it mean for a pipeline task to be idempotent, and why does it matter for backfills and retries?

An idempotent task produces the same result whether it runs once or many times, typically by writing to a deterministic partition and overwriting rather than appending. This matters because orchestrators retry failed tasks and run backfills over historical dates, and non-idempotent tasks would double-count or corrupt data on re-runs. Designing tasks to be idempotent and partitioned by execution date makes retries and backfills safe and reproducible.

Why use a pipeline orchestrator like Airflow or Kubeflow instead of cron scripts for ML workflows?

ML workflows are multi-step DAGs with dependencies, and an orchestrator gives you dependency management, retries, backfills, caching, observability, and lineage that chained cron jobs cannot. Airflow is a general-purpose task orchestrator defining DAGs in Python, while Kubeflow Pipelines is ML-native, passing typed artifacts between containerized steps on Kubernetes with conditional logic like deploy only if accuracy exceeds a threshold. Choosing depends on whether you need generic scheduling or ML-specific, container-based pipelines.

How does Apache Airflow work, and what is a DAG backfill?

Airflow models pipelines as Directed Acyclic Graphs (DAGs) of tasks, each with defined dependencies. The scheduler triggers DAG runs based on a cron schedule, passing each run a logical execution date rather than the wall-clock time. A backfill re-runs a DAG over a historical date range, allowing you to populate data for past periods after adding a new pipeline or fixing a bug — as long as tasks are idempotent.

How do you safely promote a model to production using a model registry?

Register every candidate as an immutable, versioned artifact, then move it through environments (dev to staging to prod) gated by automated checks rather than promoting straight to prod. In modern MLflow you use aliases like champion and challenger instead of the deprecated stage labels, and promotion is a governed, auditable action with sign-off and an easy rollback by repointing the alias. Always validate in staging and roll out progressively (canary or shadow) before full traffic.

Pipeline orchestration — MLOps

The last lesson left our Model CD pipeline straining against GitHub Actions — a branching, data-dependent sequence of steps forced into a flat list of jobs glued with needs:. We saw it crack the moment we wanted real pipeline behavior: fan-out over partitions, retry just the failed step, backfill last month, see which stage died. We asked what engine runs a DAG like that when a trigger no longer suffices. This lesson answers it.

Your retraining job is five scripts run by a cron that has no idea what to do when step 3 fails halfway. An orchestrator is the system that fixes this: it runs your pipeline as a DAG (directed acyclic graph of steps), handles scheduling, retries failed steps, passes data between them, and gives you a UI to see what ran, what failed, and why. (One of the four below, Kubeflow, gets its own lesson later in this section; here we map the whole landscape and how to choose.)

What an orchestrator gives you

Over a cron + scripts, an orchestrator adds:

Dependencies — “train only after data-prep succeeds” expressed as a DAG.
Retries & recovery — automatic retry with backoff; resume from the failed step.
Scheduling & triggers — time-based, or event-based (new data arrives).
Observability — a UI with run history, logs, and lineage; alerts on failure.
Backfills — re-run the pipeline over a past date range.

The four you’ll meet

Four orchestrators, four philosophies — from Airflow’s mature task DAGs to Kubeflow’s Kubernetes-native ML pipelines.

Airflow — the incumbent and still the 2026 default. Massive ecosystem of connectors, battle-tested at scale. Task-centric (you orchestrate operations), and heavier to self-host.
Dagster — asset-aware: you declare the data assets you want and Dagster works out the tasks. Strong typing, testability, and a great fit when ML sits alongside dbt/analytics.
Prefect — low-ops and pythonic, with first-class dynamic workflows (loops, branches decided at runtime) and a hybrid model. The quickest to get running.
Kubeflow — Kubernetes-native ML specifically: every step is a container, with built-in artifact/metadata lineage. Powerful, but only worth it if you’re already on K8s.

In one breath

An orchestrator is the engine that runs your ML pipeline as a DAG — adding, over a bare cron-plus-scripts, dependency ordering, automatic retries and resume-from-failure, time- and event-based scheduling, an observability UI with lineage, and backfills over past date ranges; the four you’ll meet differ by philosophy — Airflow the mature task-centric default, Dagster the data-asset-aware one, Prefect the low-ops pythonic one, Kubeflow the Kubernetes-native ML one — and you pick by your stack, not by hype.

Practice

Before the quiz, reason about the decision rule with a concrete team. You’re a four-person startup, not on Kubernetes, that wants a retrain DAG running by this afternoon — which orchestrator, and which would be exactly the wrong choice and why? Then the boundary question: the lesson insists an orchestrator is not an experiment tracker or a feature store. If a teammate proposes making Airflow also store metrics and serve features “to keep it all in one place,” what specifically goes wrong?

Quick check

0/3

Q1What does a pipeline orchestrator provide over a cron job running scripts?

Q2What distinguishes Dagster's approach from Airflow's?

Q3When is Kubeflow the right orchestrator choice?

A question to carry forward

That closes the Tooling chapter. Stand back and look at the machine we’ve built across it: experiment tracking, data versioning, a model registry, a test suite, a container, CI/CD, and now an orchestrator to run the whole DAG on a schedule. Press the button and out comes a trained, versioned, tested, blessed model sitting in the registry.

And there it sits. Because everything in this chapter was about producing a good model — and producing one is not the same as serving it. A model in a registry answers no requests; it earns nothing until some process loads it and replies to “what’s the prediction for this user?” over the network, in milliseconds, under real traffic. So the question to carry forward, out of tooling and into the next chapter, is the one the whole pipeline was building toward: how do you put a trained model behind an API and serve it to the world? That is serving with FastAPI, and it opens the Serving & Monitoring chapter.

Pipeline orchestration

What you'll learn

Before you start

What an orchestrator gives you

The four you’ll meet

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further