datarekha

Pipeline orchestration

Airflow, Dagster, Prefect, or Kubeflow? The job that schedules, retries, and connects your ML pipeline steps — and how to choose the right orchestrator for your team.

7 min read Intermediate MLOps Lesson 11 of 28

What you'll learn

  • What an orchestrator does — DAGs, scheduling, retries, observability
  • Airflow vs Dagster vs Prefect vs Kubeflow — the real tradeoffs
  • How to pick one for your team and stack

Before you start

Your retraining job is five scripts run by a cron that has no idea what to do when step 3 fails halfway. An orchestrator is the system that fixes this: it runs your pipeline as a DAG (directed acyclic graph of steps), handles scheduling, retries failed steps, passes data between them, and gives you a UI to see what ran, what failed, and why. You’ve already met Kubeflow; this lesson is the broader landscape and how to choose.

What an orchestrator gives you

Over a cron + scripts, an orchestrator adds:

  • Dependencies — “train only after data-prep succeeds” expressed as a DAG.
  • Retries & recovery — automatic retry with backoff; resume from the failed step.
  • Scheduling & triggers — time-based, or event-based (new data arrives).
  • Observability — a UI with run history, logs, and lineage; alerts on failure.
  • Backfills — re-run the pipeline over a past date range.

The four you’ll meet

Airflowthe mature defaulthuge ecosystemtask-centric DAGsheavier to operateDagsterdata-asset awaremodels the data,not just tasksgreat for dbt + MLPrefectlow-ops, pythonicdynamic workflowshybrid executionfast to startKubeflowK8s-native MLevery step a podartifact lineageneeds Kubernetes
Four orchestrators, four philosophies — from Airflow’s mature task DAGs to Kubeflow’s Kubernetes-native ML pipelines.
  • Airflow — the incumbent and still the 2026 default. Massive ecosystem of connectors, battle-tested at scale. Task-centric (you orchestrate operations), and heavier to self-host.
  • Dagsterasset-aware: you declare the data assets you want and Dagster works out the tasks. Strong typing, testability, and a great fit when ML sits alongside dbt/analytics.
  • Prefectlow-ops and pythonic, with first-class dynamic workflows (loops, branches decided at runtime) and a hybrid model. The quickest to get running.
  • KubeflowKubernetes-native ML specifically: every step is a container, with built-in artifact/metadata lineage. Powerful, but only worth it if you’re already on K8s.

Quick check

Quick check

0/3
Q1What does a pipeline orchestrator provide over a cron job running scripts?
Q2What distinguishes Dagster's approach from Airflow's?
Q3When is Kubeflow the right orchestrator choice?

Next

Orchestration runs the retraining loop that drift triggers — and schedules the tests that gate each run.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
Why use a pipeline orchestrator like Airflow or Kubeflow instead of cron scripts for ML workflows?

ML workflows are multi-step DAGs with dependencies, and an orchestrator gives you dependency management, retries, backfills, caching, observability, and lineage that chained cron jobs cannot. Airflow is a general-purpose task orchestrator defining DAGs in Python, while Kubeflow Pipelines is ML-native, passing typed artifacts between containerized steps on Kubernetes with conditional logic like deploy only if accuracy exceeds a threshold. Choosing depends on whether you need generic scheduling or ML-specific, container-based pipelines.

How does Apache Airflow work, and what is a DAG backfill?

Airflow models pipelines as Directed Acyclic Graphs (DAGs) of tasks, each with defined dependencies. The scheduler triggers DAG runs based on a cron schedule, passing each run a logical execution date rather than the wall-clock time. A backfill re-runs a DAG over a historical date range, allowing you to populate data for past periods after adding a new pipeline or fixing a bug — as long as tasks are idempotent.

How do you safely promote a model to production using a model registry?

Register every candidate as an immutable, versioned artifact, then move it through environments (dev to staging to prod) gated by automated checks rather than promoting straight to prod. In modern MLflow you use aliases like champion and challenger instead of the deprecated stage labels, and promotion is a governed, auditable action with sign-off and an easy rollback by repointing the alias. Always validate in staging and roll out progressively (canary or shadow) before full traffic.

How do you attribute and control ML spend across teams and models (FinOps for ML)?

Apply FinOps to ML by tagging every workload (training jobs, endpoints, GPU pools) by team, model, and environment so cost is attributable, then track unit-economics metrics like cost per prediction or per training run rather than just total spend. Set budgets and alerts, identify idle GPUs and overprovisioned endpoints, and enforce guardrails like autoscaling and instance-type policies. The goal is continuous visibility and accountability so teams optimize cost without killing experimentation.

Related lessons

Explore further

Skip to content