What does it mean for a pipeline task to be idempotent, and why does it matter for backfills and retries?
An idempotent task produces the same result whether it runs once or many times, typically by writing to a deterministic partition and overwriting rather than appending. This matters because orchestrators retry failed tasks and run backfills over historical dates, and non-idempotent tasks would double-count or corrupt data on re-runs. Designing tasks to be idempotent and partitioned by execution date makes retries and backfills safe and reproducible.
How to think about it
The short answer
A task is idempotent if running it once or ten times yields the same end state. You achieve it by writing to a deterministic, date-partitioned location and overwriting rather than appending. It matters because orchestrators retry failures and backfill historical dates — and a non-idempotent task double-counts or corrupts data every time it re-runs.
Why
Failures are normal: a node dies, an API times out, the orchestrator retries the task. If the task INSERTs rows, the retry inserts them again → duplicates. Backfilling (re-running the pipeline for, say, all of last month) amplifies this across many dates. Idempotency makes retries and backfills safe and reproducible, which is a core data-engineering best practice for orchestrating ML pipelines.
How to make a task idempotent
- Partition by execution date and write to
output/dt=2026-06-10/, overwriting that partition on each run. - Prefer upsert/overwrite semantics over blind append.
- Make the task a pure function of its inputs and the run’s logical date — never
now()or a mutable counter. - Make side effects (model registration, notifications) keyed so a retry replaces rather than duplicates.
Concrete example
A feature job for dt=2026-06-10 recomputes that day’s aggregates and overwrites the partition. If it fails halfway and retries, the final partition is correct — no duplicated features. When a bug is fixed, you backfill 30 days and each day cleanly overwrites, giving identical results to a fresh run.
Common follow-up / trap
A classic trap is using datetime.now() inside a task instead of the orchestrator’s logical/execution date — this breaks backfills because re-running a past date computes against today. Interviewers also probe: “What about a model that registers itself each run?” The fix is keying the registration to the run so a retry updates the same version rather than creating duplicates.