How does Apache Airflow work, and what is a DAG backfill?
Airflow models pipelines as Directed Acyclic Graphs (DAGs) of tasks, each with defined dependencies. The scheduler triggers DAG runs based on a cron schedule, passing each run a logical execution date rather than the wall-clock time. A backfill re-runs a DAG over a historical date range, allowing you to populate data for past periods after adding a new pipeline or fixing a bug — as long as tasks are idempotent.
How to think about it
Airflow is the dominant open-source orchestration platform for batch pipelines. Understanding its scheduling model — especially the distinction between logical execution date and wall-clock time — is critical for writing correct, idempotent DAGs.
DAG anatomy
A DAG is a Python file that defines tasks and their dependencies. Tasks are not data processors; they are operators that call external systems (SQL, Spark, HTTP, etc.).
from airflow import DAG
from airflow.providers.common.sql.operators.sql import SQLExecuteQueryOperator
from datetime import datetime
with DAG(
dag_id="orders_daily",
schedule="@daily", # runs once per day
start_date=datetime(2026, 1, 1),
catchup=True, # enables backfill on historical dates
) as dag:
transform = SQLExecuteQueryOperator(
task_id="transform_orders",
conn_id="snowflake_prod",
sql="""
INSERT OVERWRITE INTO orders_daily
SELECT DATE(created_at), SUM(amount)
FROM orders
WHERE DATE(created_at) = '{{ ds }}' -- logical date, not now()
""",
)
{{ ds }} is the logical execution date (the date the interval represents, not when the task runs). A daily DAG scheduled @daily with start_date=2026-01-01 has ds=2026-01-01 for its first run, even if that run actually executes on January 2nd.
Task dependencies
extract >> transform >> load_to_warehouse
# extract runs first; transform only starts when extract succeeds
Backfill
A backfill re-executes a DAG for a range of historical execution dates. Use cases:
- A new pipeline added today needs historical data going back 12 months.
- A bug in a transform was fixed; affected partitions need to be reprocessed.
airflow dags backfill orders_daily \
--start-date 2026-01-01 \
--end-date 2026-05-31
Airflow creates one DAG run per scheduled interval in the date range and runs them (with configurable parallelism). Each run receives its own ds value, so if tasks use {{ ds }} and overwrite their partition, the backfill is safe and idempotent.
Key configuration choices
| Setting | Effect |
|---|---|
catchup=True | Airflow automatically backfills missed runs from start_date to now when a DAG is first turned on |
max_active_runs | Limits how many DAG runs execute in parallel during a backfill |
retries | Number of automatic retries per task on failure |