Pipeline orchestration
Airflow, Dagster, Prefect, or Kubeflow? The job that schedules, retries, and connects your ML pipeline steps — and how to choose the right orchestrator for your team.
What you'll learn
- What an orchestrator does — DAGs, scheduling, retries, observability
- Airflow vs Dagster vs Prefect vs Kubeflow — the real tradeoffs
- How to pick one for your team and stack
Before you start
Your retraining job is five scripts run by a cron that has no idea what to do when step 3 fails halfway. An orchestrator is the system that fixes this: it runs your pipeline as a DAG (directed acyclic graph of steps), handles scheduling, retries failed steps, passes data between them, and gives you a UI to see what ran, what failed, and why. You’ve already met Kubeflow; this lesson is the broader landscape and how to choose.
What an orchestrator gives you
Over a cron + scripts, an orchestrator adds:
- Dependencies — “train only after data-prep succeeds” expressed as a DAG.
- Retries & recovery — automatic retry with backoff; resume from the failed step.
- Scheduling & triggers — time-based, or event-based (new data arrives).
- Observability — a UI with run history, logs, and lineage; alerts on failure.
- Backfills — re-run the pipeline over a past date range.
The four you’ll meet
- Airflow — the incumbent and still the 2026 default. Massive ecosystem of connectors, battle-tested at scale. Task-centric (you orchestrate operations), and heavier to self-host.
- Dagster — asset-aware: you declare the data assets you want and Dagster works out the tasks. Strong typing, testability, and a great fit when ML sits alongside dbt/analytics.
- Prefect — low-ops and pythonic, with first-class dynamic workflows (loops, branches decided at runtime) and a hybrid model. The quickest to get running.
- Kubeflow — Kubernetes-native ML specifically: every step is a container, with built-in artifact/metadata lineage. Powerful, but only worth it if you’re already on K8s.
Quick check
Quick check
Next
Orchestration runs the retraining loop that drift triggers — and schedules the tests that gate each run.
Practice this in an interview
All questionsML workflows are multi-step DAGs with dependencies, and an orchestrator gives you dependency management, retries, backfills, caching, observability, and lineage that chained cron jobs cannot. Airflow is a general-purpose task orchestrator defining DAGs in Python, while Kubeflow Pipelines is ML-native, passing typed artifacts between containerized steps on Kubernetes with conditional logic like deploy only if accuracy exceeds a threshold. Choosing depends on whether you need generic scheduling or ML-specific, container-based pipelines.
Airflow models pipelines as Directed Acyclic Graphs (DAGs) of tasks, each with defined dependencies. The scheduler triggers DAG runs based on a cron schedule, passing each run a logical execution date rather than the wall-clock time. A backfill re-runs a DAG over a historical date range, allowing you to populate data for past periods after adding a new pipeline or fixing a bug — as long as tasks are idempotent.
Register every candidate as an immutable, versioned artifact, then move it through environments (dev to staging to prod) gated by automated checks rather than promoting straight to prod. In modern MLflow you use aliases like champion and challenger instead of the deprecated stage labels, and promotion is a governed, auditable action with sign-off and an easy rollback by repointing the alias. Always validate in staging and roll out progressively (canary or shadow) before full traffic.
Apply FinOps to ML by tagging every workload (training jobs, endpoints, GPU pools) by team, model, and environment so cost is attributable, then track unit-economics metrics like cost per prediction or per training run rather than just total spend. Set budgets and alerts, identify idle GPUs and overprovisioned endpoints, and enforce guardrails like autoscaling and instance-type policies. The goal is continuous visibility and accountability so teams optimize cost without killing experimentation.