datarekha
MLOps Easy

Why use a pipeline orchestrator like Airflow or Kubeflow instead of cron scripts for ML workflows?

The short answer

ML workflows are multi-step DAGs with dependencies, and an orchestrator gives you dependency management, retries, backfills, caching, observability, and lineage that chained cron jobs cannot. Airflow is a general-purpose task orchestrator defining DAGs in Python, while Kubeflow Pipelines is ML-native, passing typed artifacts between containerized steps on Kubernetes with conditional logic like deploy only if accuracy exceeds a threshold. Choosing depends on whether you need generic scheduling or ML-specific, container-based pipelines.

How to think about it

The short answer

An ML workflow is a DAG of dependent steps — ingest, validate, featurize, train, evaluate, deploy. Cron just fires scripts on a clock with no awareness of dependencies or failures. An orchestrator gives you dependency management, automatic retries, backfills, step caching, observability, and lineage — the things cron can’t.

Why it matters

With chained cron jobs, if step 2 fails, step 3 still runs on stale data and you find out hours later. There’s no central view of what ran, no easy re-run of just the failed step, and no record of which run produced which model. Orchestrators make tasks atomic and idempotent and surface the whole pipeline’s state. As Made With ML frames it, orchestration is what turns a pile of scripts into a reliable, repeatable system.

Airflow vs Kubeflow

  • Airflow is a general-purpose task orchestrator: define a DAG in Python, schedule anything (data engineering, ML, ETL). Reach for it when you want one tool across many task types.
  • Kubeflow Pipelines is ML-native: steps are containers running on Kubernetes, it passes typed artifacts between them, and supports conditionals like “only deploy if eval accuracy > 0.9.” Its per-step GPU control matters for heterogeneous training. It pays off when multiple teams share MLOps infra on a cluster.

Concrete example

A daily retraining DAG: validate_data → train → evaluate → (if better than champion) register → canary. If validate_data detects a schema break, the orchestrator stops the run and alerts — no garbage model gets trained, and you re-run just that branch after the fix.

Common follow-up / trap

Interviewers ask: “Are Airflow, Kubeflow, and Ray the same?” No — they sit at different layers (orchestration vs ML pipelines vs distributed compute). Conflating them signals shallow infra intuition. The trap is defaulting to “just use Airflow” for everything; name the ML-native benefits (typed artifacts, container isolation, GPU control) that push you toward Kubeflow/ZenML/Metaflow.

Learn it properly Pipeline orchestration

Keep practising

All MLOps questions

Explore further

Skip to content