datarekha

Orchestration: Airflow & DAGs

A real data platform is dozens of jobs with dependencies — extracts, transforms, loads, CDC sinks, reverse-ETL syncs. Why cron can't run them safely, how an orchestrator executes a DAG in dependency order with retries and backfills, and where Airflow, Dagster, and dbt fit.

8 min read Intermediate SQL Lesson 27 of 27

What you'll learn

  • Why cron is not enough — it knows time, not dependencies, failures, or backfills
  • The DAG model — tasks plus directed dependencies the orchestrator runs in order
  • Retries, backfills, and why every task must be idempotent
  • How a failure blocks downstream tasks instead of poisoning them with bad data
  • Where Airflow, Dagster, Prefect, and dbt fit in the modern stack

Before you start

You’ve now seen every piece of a data platform: data lands via ETL and CDC, gets modeled and stored columnar, and flows back out via reverse ETL. But a piece is not a pipeline. In production there are dozens of these jobs, and they depend on each other — you can’t load the warehouse before the transform finishes, and you can’t sync to Salesforce before the load. The system that runs all of them, in the right order, on a schedule, recovering from failures, is an orchestrator. It’s the conductor that turns a pile of jobs into a dependable pipeline.

Why not just cron?

The obvious idea is cron: “run the extract at 1 a.m., the transform at 2, the load at 3.” It falls apart fast, because cron knows exactly one thing — the time — and nothing about the work:

  • It has no concept of dependencies. If the 1 a.m. extract runs slow and finishes at 2:05, the 2 a.m. transform already fired on yesterday’s data.
  • It has no failure handling. If the extract crashes, cron cheerfully runs the transform and load anyway — now every downstream table is built from missing data, silently.
  • It can’t retry a flaky task, can’t backfill last week’s re-processing, and gives you no single view of what ran, what’s late, and what broke.

An orchestrator replaces “run at this time” with “run when its inputs are ready, retry if it fails, and tell me when something’s wrong.”

A pipeline is a DAG

The core abstraction is the DAG — a directed acyclic graph of tasks. You declare each task and what it depends on; the orchestrator figures out the rest: it runs tasks in a valid order, runs independent branches in parallel, and — crucially — when a task fails, it marks everything downstream as blocked rather than running it on bad or missing data.

(Heads-up on the word: this workflow DAG is a different DAG from Spark’s internal execution DAG. Same graph idea, different layer — one schedules jobs, the other plans one job’s operators.)

Run the pipeline, then break it:

TryRun the DAG

A pipeline is a graph, not a script

Hit Run: each task starts only when its upstream tasks succeed, and the two extracts run together. Inject a failure and watch it block everything downstream — then turn on retries.

extract_ordersextract_userstransform_salesload_warehousequality_checksync_reverse_etl
pendingrunningsuccessfailedblocked

The graph defines dependencies, not a fixed sequence — the orchestrator derives the order, runs independent branches together, and reacts to failures.

That blocking behavior is the heart of it. Without an orchestrator, a failed transform poisons the warehouse load silently. With one, the load never runs, the on-call engineer is paged, and yesterday’s good data stays untouched until the fix lands.

The capabilities that matter

  • Dependency-aware scheduling. Run on a schedule and only once inputs are ready (data-aware / sensor-based triggering).
  • Retries with backoff. Transient failures (a network blip, a rate limit) retry automatically — you saw the task recover on attempt 2.
  • Backfills. Re-run a task across a historical date range — reprocess all of last month after fixing a bug — using the same DAG.
  • Observability & alerting. One place to see every run’s status, logs, and timing, with alerts on failure or lateness (SLAs).
  • Parametrization. The same DAG runs for any logical date, which is what makes backfills and reruns clean.

The golden rule: idempotency

Retries and backfills both mean a task will run more than once for the same date. So every task must be idempotent — running it twice produces the same result as running it once. The standard technique is partition overwrite: a task for 2026-06-01 deletes-and-replaces that date’s partition rather than appending. Append-only tasks double-count on the first retry; idempotent tasks are safe to re-run forever. (This is the same idempotency idea you met in CDC — and it’s just as load-bearing here.)

The tools

  • Apache Airflow is the long-standing standard: DAGs written in Python, a rich operator/provider ecosystem, a mature scheduler and UI. If a job ad says “orchestration,” it usually means Airflow.
  • Dagster is the modern, asset-oriented challenger: you declare the data assets (tables) you want and their dependencies, with strong typing, testing, and lineage built in.
  • Prefect emphasizes a lightweight, Pythonic, dynamic flow API.
  • dbt isn’t a general orchestrator but builds a DAG of SQL transforms inside the warehouse — and is very often the “transform” step that a bigger Airflow/Dagster DAG triggers.
  • Managed options — MWAA (AWS), Cloud Composer (GCP), Astronomer — run Airflow for you.

Quick check

Quick check

0/3
Q1Why is cron insufficient for running a multi-step data pipeline?
Q2In a DAG-based orchestrator, what happens to tasks downstream of one that fails (and exhausts retries)?
Q3TRANSFER: A daily task does INSERT INTO sales_daily SELECT ... WHERE date = '{{ ds }}'. After a transient failure it's retried and now that day appears twice. What's the root cause and the idempotent fix?

Next

That completes the data-platform arc: model it, version it, store it columnar, stream it in with CDC, push it back out with reverse ETL, and orchestrate the whole thing as a DAG. From here, the natural next step is the engine that actually executes these transforms at scale — that’s the PySpark track, where you’ll meet a different DAG (Spark’s own execution plan) doing the heavy lifting inside a single job.

FAQCommon questions

Questions about this lesson

What is a data orchestrator?

A data orchestrator runs the many jobs of a data platform — extracts, transforms, loads, CDC sinks, reverse-ETL syncs — in the correct order, on a schedule, with retries, backfills, and alerting. Instead of firing jobs on fixed timers, it runs each task when its inputs are ready and blocks downstream tasks if an upstream one fails. Airflow, Dagster, and Prefect are the common tools.

Why is cron not enough to run a data pipeline?

Cron only knows the clock. It has no concept of dependencies, so it will run a transform even if the upstream extract failed or ran late; it cannot retry a flaky task or backfill history; and it gives no unified view of what ran or broke. The result is that a late or broken job silently corrupts everything scheduled after it. An orchestrator runs tasks on dependency-readiness and handles failure.

What is a DAG in Airflow?

A DAG (directed acyclic graph) defines the tasks in a pipeline and the dependencies between them. The orchestrator uses it to derive a valid execution order, run independent branches in parallel, and mark tasks downstream of a failure as blocked rather than running them on bad data. Note this workflow DAG is different from Spark's internal execution DAG — same graph idea, different layer.

Practice this in an interview

All questions
How does Apache Airflow work, and what is a DAG backfill?

Airflow models pipelines as Directed Acyclic Graphs (DAGs) of tasks, each with defined dependencies. The scheduler triggers DAG runs based on a cron schedule, passing each run a logical execution date rather than the wall-clock time. A backfill re-runs a DAG over a historical date range, allowing you to populate data for past periods after adding a new pipeline or fixing a bug — as long as tasks are idempotent.

What is the difference between batch and streaming data pipelines, and how do you choose between them?

Batch pipelines process data in bounded chunks on a schedule — simple to build and test, but latency is measured in hours or days. Streaming pipelines process records continuously as they arrive — latency drops to seconds or milliseconds, but correctness requires handling late arrivals, watermarks, and stateful aggregations. Choose streaming when business decisions need fresh data; choose batch when daily freshness is acceptable and operational simplicity matters.

What is the difference between ETL and ELT, and when should you choose each?

ETL transforms data before loading it into the destination, which was necessary when warehouses were expensive and compute-constrained. ELT loads raw data first and transforms inside the warehouse, leveraging cheap cloud compute and making raw data available for reprocessing. ELT is the default in modern cloud stacks; ETL still makes sense when you must mask sensitive fields before they ever land in the warehouse.

How do you handle schema evolution in data pipelines without breaking downstream consumers?

Schema evolution covers adding, renaming, removing, or retyping columns in a data stream or table over time. Safe strategies include: only adding nullable columns (backwards-compatible), using schema registries to enforce compatibility rules before a producer publishes, and open table formats like Iceberg that track schema history and allow column renames and reorders without rewriting data.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content