datarekha
Data Engineering Medium Asked at AirflowAsked at AstronomerAsked at GoogleAsked at AmazonAsked at Uber

How does Apache Airflow work, and what is a DAG backfill?

The short answer

Airflow models pipelines as Directed Acyclic Graphs (DAGs) of tasks, each with defined dependencies. The scheduler triggers DAG runs based on a cron schedule, passing each run a logical execution date rather than the wall-clock time. A backfill re-runs a DAG over a historical date range, allowing you to populate data for past periods after adding a new pipeline or fixing a bug — as long as tasks are idempotent.

How to think about it

Airflow is the dominant open-source orchestration platform for batch pipelines. Understanding its scheduling model — especially the distinction between logical execution date and wall-clock time — is critical for writing correct, idempotent DAGs.

DAG anatomy

A DAG is a Python file that defines tasks and their dependencies. Tasks are not data processors; they are operators that call external systems (SQL, Spark, HTTP, etc.).

from airflow import DAG
from airflow.providers.common.sql.operators.sql import SQLExecuteQueryOperator
from datetime import datetime

with DAG(
    dag_id="orders_daily",
    schedule="@daily",              # runs once per day
    start_date=datetime(2026, 1, 1),
    catchup=True,                   # enables backfill on historical dates
) as dag:

    transform = SQLExecuteQueryOperator(
        task_id="transform_orders",
        conn_id="snowflake_prod",
        sql="""
            INSERT OVERWRITE INTO orders_daily
            SELECT DATE(created_at), SUM(amount)
            FROM orders
            WHERE DATE(created_at) = '{{ ds }}'  -- logical date, not now()
        """,
    )

{{ ds }} is the logical execution date (the date the interval represents, not when the task runs). A daily DAG scheduled @daily with start_date=2026-01-01 has ds=2026-01-01 for its first run, even if that run actually executes on January 2nd.

Task dependencies

extract >> transform >> load_to_warehouse
# extract runs first; transform only starts when extract succeeds

Backfill

A backfill re-executes a DAG for a range of historical execution dates. Use cases:

  • A new pipeline added today needs historical data going back 12 months.
  • A bug in a transform was fixed; affected partitions need to be reprocessed.
airflow dags backfill orders_daily \
  --start-date 2026-01-01 \
  --end-date   2026-05-31

Airflow creates one DAG run per scheduled interval in the date range and runs them (with configurable parallelism). Each run receives its own ds value, so if tasks use {{ ds }} and overwrite their partition, the backfill is safe and idempotent.

Key configuration choices

SettingEffect
catchup=TrueAirflow automatically backfills missed runs from start_date to now when a DAG is first turned on
max_active_runsLimits how many DAG runs execute in parallel during a backfill
retriesNumber of automatic retries per task on failure
Learn it properly Orchestration: Airflow & DAGs

Keep practising

All Data Engineering questions

Explore further

Skip to content