Orchestration: Airflow & DAGs

A real data platform is dozens of jobs with dependencies — extracts, transforms, loads, CDC sinks, reverse-ETL syncs. Why cron can't run them safely, how an orchestrator executes a DAG in dependency order with retries and backfills, and where Airflow, Dagster, and dbt fit.

8 min read Intermediate SQL Lesson 27 of 27

What you'll learn

Why cron is not enough — it knows time, not dependencies, failures, or backfills
The DAG model — tasks plus directed dependencies the orchestrator runs in order
Retries, backfills, and why every task must be idempotent
How a failure blocks downstream tasks instead of poisoning them with bad data
Where Airflow, Dagster, Prefect, and dbt fit in the modern stack

Before you start

ETL vs ELT SQL · lesson Change Data Capture (CDC) SQL · lesson Python section

You’ve now seen every piece of a data platform: data lands via ETL and CDC, gets modeled and stored columnar, and flows back out via reverse ETL. But a piece is not a pipeline. In production there are dozens of these jobs, and they depend on each other — you can’t load the warehouse before the transform finishes, and you can’t sync to Salesforce before the load. The system that runs all of them, in the right order, on a schedule, recovering from failures, is an orchestrator. It’s the conductor that turns a pile of jobs into a dependable pipeline.

Why not just cron?

The obvious idea is cron: “run the extract at 1 a.m., the transform at 2, the load at 3.” It falls apart fast, because cron knows exactly one thing — the time — and nothing about the work:

It has no concept of dependencies. If the 1 a.m. extract runs slow and finishes at 2:05, the 2 a.m. transform already fired on yesterday’s data.
It has no failure handling. If the extract crashes, cron cheerfully runs the transform and load anyway — now every downstream table is built from missing data, silently.
It can’t retry a flaky task, can’t backfill last week’s re-processing, and gives you no single view of what ran, what’s late, and what broke.

An orchestrator replaces “run at this time” with “run when its inputs are ready, retry if it fails, and tell me when something’s wrong.”

A pipeline is a DAG

The core abstraction is the DAG — a directed acyclic graph of tasks. You declare each task and what it depends on; the orchestrator figures out the rest: it runs tasks in a valid order, runs independent branches in parallel, and — crucially — when a task fails, it marks everything downstream as blocked rather than running it on bad or missing data.

(Heads-up on the word: this workflow DAG is a different DAG from Spark’s internal execution DAG. Same graph idea, different layer — one schedules jobs, the other plans one job’s operators.)

Trace the graph, then watch what happens the moment one task fails:

That blocking behavior is the heart of it. Without an orchestrator, a failed transform poisons the warehouse load silently. With one, the load never runs, the on-call engineer is paged, and yesterday’s good data stays untouched until the fix lands.

The capabilities that matter

Dependency-aware scheduling. Run on a schedule and only once inputs are ready (data-aware / sensor-based triggering).
Retries with backoff. Transient failures (a network blip, a rate limit) retry automatically — you saw the task recover on attempt 2.
Backfills. Re-run a task across a historical date range — reprocess all of last month after fixing a bug — using the same DAG.
Observability & alerting. One place to see every run’s status, logs, and timing, with alerts on failure or lateness (SLAs).
Parametrization. The same DAG runs for any logical date, which is what makes backfills and reruns clean.

The golden rule: idempotency

Retries and backfills both mean a task will run more than once for the same date. So every task must be idempotent — running it twice produces the same result as running it once. The standard technique is partition overwrite: a task for 2026-06-01 deletes-and-replaces that date’s partition rather than appending. Append-only tasks double-count on the first retry; idempotent tasks are safe to re-run forever. (This is the same idempotency idea you met in CDC — and it’s just as load-bearing here.)

The tools

Apache Airflow is the long-standing standard: DAGs written in Python, a rich operator/provider ecosystem, a mature scheduler and UI. If a job ad says “orchestration,” it usually means Airflow.
Dagster is the modern, asset-oriented challenger: you declare the data assets (tables) you want and their dependencies, with strong typing, testing, and lineage built in.
Prefect emphasizes a lightweight, Pythonic, dynamic flow API.
dbt isn’t a general orchestrator but builds a DAG of SQL transforms inside the warehouse — and is very often the “transform” step that a bigger Airflow/Dagster DAG triggers.
Managed options — MWAA (AWS), Cloud Composer (GCP), Astronomer — run Airflow for you.

Quick check

0/3

Q1Why is cron insufficient for running a multi-step data pipeline?

Q2In a DAG-based orchestrator, what happens to tasks downstream of one that fails (and exhausts retries)?

Q3TRANSFER: A daily task does INSERT INTO sales_daily SELECT ... WHERE date = '{{ ds }}'. After a transient failure it's retried and now that day appears twice. What's the root cause and the idempotent fix?

That completes the data-platform arc: model it, version it, store it columnar, stream it in with CDC, push it back out with reverse ETL, and orchestrate the whole thing as a DAG. From here, the natural next step is the engine that actually executes these transforms at scale — that’s the PySpark track, where you’ll meet a different DAG (Spark’s own execution plan) doing the heavy lifting inside a single job.

FAQCommon questions

Questions about this lesson

What is a data orchestrator?

A data orchestrator runs the many jobs of a data platform — extracts, transforms, loads, CDC sinks, reverse-ETL syncs — in the correct order, on a schedule, with retries, backfills, and alerting. Instead of firing jobs on fixed timers, it runs each task when its inputs are ready and blocks downstream tasks if an upstream one fails. Airflow, Dagster, and Prefect are the common tools.

Why is cron not enough to run a data pipeline?

Cron only knows the clock. It has no concept of dependencies, so it will run a transform even if the upstream extract failed or ran late; it cannot retry a flaky task or backfill history; and it gives no unified view of what ran or broke. The result is that a late or broken job silently corrupts everything scheduled after it. An orchestrator runs tasks on dependency-readiness and handles failure.

What is a DAG in Airflow?

A DAG (directed acyclic graph) defines the tasks in a pipeline and the dependencies between them. The orchestrator uses it to derive a valid execution order, run independent branches in parallel, and mark tasks downstream of a failure as blocked rather than running them on bad data. Note this workflow DAG is different from Spark's internal execution DAG — same graph idea, different layer.

Practice this in an interview

All questions

How does Apache Airflow work, and what is a DAG backfill?

Airflow models pipelines as Directed Acyclic Graphs (DAGs) of tasks, each with defined dependencies. The scheduler triggers DAG runs based on a cron schedule, passing each run a logical execution date rather than the wall-clock time. A backfill re-runs a DAG over a historical date range, allowing you to populate data for past periods after adding a new pipeline or fixing a bug — as long as tasks are idempotent.

Why use a pipeline orchestrator like Airflow or Kubeflow instead of cron scripts for ML workflows?

ML workflows are multi-step DAGs with dependencies, and an orchestrator gives you dependency management, retries, backfills, caching, observability, and lineage that chained cron jobs cannot. Airflow is a general-purpose task orchestrator defining DAGs in Python, while Kubeflow Pipelines is ML-native, passing typed artifacts between containerized steps on Kubernetes with conditional logic like deploy only if accuracy exceeds a threshold. Choosing depends on whether you need generic scheduling or ML-specific, container-based pipelines.

What is the difference between batch and streaming data pipelines, and how do you choose between them?

Batch pipelines process data in bounded chunks on a schedule — simple to build and test, but latency is measured in hours or days. Streaming pipelines process records continuously as they arrive — latency drops to seconds or milliseconds, but correctness requires handling late arrivals, watermarks, and stateful aggregations. Choose streaming when business decisions need fresh data; choose batch when daily freshness is acceptable and operational simplicity matters.

What is the difference between ETL and ELT, and when should you choose each?

ETL transforms data before loading it into the destination, which was necessary when warehouses were expensive and compute-constrained. ELT loads raw data first and transforms inside the warehouse, leveraging cheap cloud compute and making raw data available for reprocessing. ELT is the default in modern cloud stacks; ETL still makes sense when you must mask sensitive fields before they ever land in the warehouse.

Explore further

Glossary terms

Airflow Orchestration Data Pipeline Spark

Cheat sheets

SQL scikit-learn Pandas

Orchestration: Airflow & DAGs

What you'll learn

Before you start

Why not just cron?

A pipeline is a DAG

The capabilities that matter

The golden rule: idempotency

The tools

Quick check

Quick check

Next

Sign in to track your progress

Questions about this lesson

Practice this in an interview

Related lessons

Explore further