Change Data Capture (CDC)
How modern pipelines keep a warehouse fresh to the second — by streaming every insert, update, and delete out of a source database the instant it commits. Log-based vs query-based capture, why only the log sees deletes, Debezium and the WAL/binlog, and feeding SCD Type 2.
What you'll learn
- What a change event is — operation, before/after image, and commit order
- Query-based capture (polling) vs log-based capture (reading the transaction log)
- Why polling is blind to deletes and intermediate states — and the log isn't
- How Debezium reads the Postgres WAL / MySQL binlog and publishes change streams
- Where CDC feeds — zero-ETL replication, SCD Type 2 via MERGE, and event-driven systems
Before you start
A nightly batch ETL job answers the question “what did the business look like yesterday?” For a growing list of use cases — fraud checks, live dashboards, keeping a search index or cache in sync — yesterday is far too old. Change Data Capture (CDC) is how you close the gap: instead of re-copying a whole table on a schedule, you capture every individual change — each insert, update, and delete — the moment it commits, and stream it onward. The warehouse stops being a stale snapshot and becomes a near-real-time mirror.
A change is a first-class event
The core mental shift: a row change is data. Each CDC event carries the
operation (create / update / delete), the after-image (the
new row), usually a before-image (the old row, for updates and
deletes), and a position/timestamp (LSN, offset) that fixes its place in
commit order. A consumer that replays this stream in order reconstructs
the source table exactly — and can also react to each change.
Two ways to capture changes
There are two families, and the difference is the whole lesson.
Query-based (polling). Add an updated_at column, and every so often
run SELECT * WHERE updated_at > :last_seen. Dead simple, needs no special
database access. But it has holes you can’t patch: a DELETE leaves no
updated_at to find, so deletes are invisible; you only ever see the
latest state between polls, missing intermediate changes; and frequent
polling hammers the source while still adding latency.
Log-based. Every transactional database already writes an ordered, durable record of every change — the write-ahead log (Postgres WAL, MySQL binlog, Oracle redo, SQL Server transaction log) — because that’s how it guarantees durability and replication. Log-based CDC reads that log and turns it into a change stream. It sees everything, including deletes and every intermediate update, in exact commit order, with minimal load on the source (reading the log is cheap; the database already wrote it).
Step through both and watch the target drift:
Stream changes from a source DB to the warehouse
Apply commits one at a time. Log-based CDC reads the transaction log and captures every operation; query-based polling misses deletes. Watch the target drift.
| id | cust | status | total |
|---|---|---|---|
| 101 | Ivo | paid | 40 |
| 102 | Lena | new | 90 |
| 103 | Sun | paid | 25 |
| id | cust | status | total |
|---|---|---|---|
| 101 | Ivo | paid | 40 |
| 102 | Lena | new | 90 |
| 103 | Sun | paid | 25 |
Log-based: the target is an exact mirror — inserts, updates, and deletes all flow through the transaction log in commit order.
The log-based stack in practice
The canonical open-source tool is Debezium. It connects to a database’s replication interface — Postgres logical decoding, MySQL’s binlog replication protocol — decodes the log into structured change events, and publishes them, most often to Apache Kafka. Downstream, sink connectors land those events in a warehouse, a lake, Elasticsearch, or another service. The managed cloud equivalents (AWS DMS, Fivetran, GCP Datastream, and the zero-ETL integrations you met in the OLTP-vs-OLAP lesson) wrap the same log-reading idea behind a UI.
A real pipeline also handles the initial snapshot: before it can stream changes, it reads the table’s current contents once, then switches to tailing the log from the exact LSN where the snapshot ended — so nothing is missed or double-counted at the boundary.
What CDC feeds
- Warehouse / lakehouse replication. Keep an analytical copy continuously in sync instead of nightly reloads — the modern, low-latency face of ELT.
- SCD Type 2. The change stream is the input to the dimension-history
MERGEyou saw last lesson: each captured update expires the old row and inserts a new version. - Event-driven systems. Invalidate caches, update search indexes, or emit domain events (the transactional outbox pattern) — all driven off the same committed changes, with no dual-write inconsistency.
The hard parts
CDC isn’t free. Streams can deliver a change more than once after a retry, so consumers must be idempotent — apply by primary key with an upsert so a replayed event is a no-op. Ordering must be preserved per key (Kafka does this by partitioning on the key). Schema changes at the source (a new column, a dropped one) have to propagate without breaking consumers. And you need monitoring for replication lag — if the consumer falls behind the log, freshness silently degrades.
Quick check
Quick check
Next
CDC streams data into the warehouse, where it’s modeled, versioned, and stored columnar for analytics. But the warehouse isn’t the last stop. Often the insights it produces — a churn score, a customer segment, a fresh lead list — need to travel back out to the tools where people work. That’s reverse ETL, next.
Questions about this lesson
What is Change Data Capture (CDC)?
CDC captures every individual change — insert, update, and delete — made to a source database and streams it onward the moment it commits, instead of re-copying whole tables on a schedule. Each change event carries the operation, the new (and usually old) row image, and a position in commit order, so a consumer can keep a warehouse or other system continuously in sync.
What is the difference between log-based and query-based CDC?
Query-based CDC polls the source with something like WHERE updated_at > last_seen — simple, but blind to deletes and to intermediate states between polls. Log-based CDC reads the database's transaction log (Postgres WAL, MySQL binlog), so it captures every insert, update, and delete in exact commit order with minimal load. Log-based is the robust choice; Debezium is the common tool.
Why can't query-based CDC capture deletes?
Query-based polling finds rows whose updated_at timestamp advanced. A DELETE removes the row entirely and leaves no timestamp behind, so the next poll has nothing to match and never sees the row disappear. Only reading the transaction log (or using soft-delete tombstones) captures deletions — which is why deleted rows linger as stale ghosts under polling.
Practice this in an interview
All questionsCDC continuously captures row-level inserts, updates, and deletes from a source database and streams them downstream — enabling near-real-time replication to a warehouse or data lake without full table scans. The most robust implementation reads the database's write-ahead log (WAL), making it low-impact on the source and capable of capturing deletes that polling-based approaches miss entirely.
Batch pipelines process data in bounded chunks on a schedule — simple to build and test, but latency is measured in hours or days. Streaming pipelines process records continuously as they arrive — latency drops to seconds or milliseconds, but correctness requires handling late arrivals, watermarks, and stateful aggregations. Choose streaming when business decisions need fresh data; choose batch when daily freshness is acceptable and operational simplicity matters.
Schema evolution covers adding, renaming, removing, or retyping columns in a data stream or table over time. Safe strategies include: only adding nullable columns (backwards-compatible), using schema registries to enforce compatibility rules before a producer publishes, and open table formats like Iceberg that track schema history and allow column renames and reorders without rewriting data.
An idempotent pipeline produces the same output no matter how many times it runs for the same logical window — rerunning it on an already-processed date partition yields identical results rather than duplicated rows. Achieving idempotency typically means using INSERT OVERWRITE (or MERGE) instead of plain INSERT, keying every record with a deterministic ID, and deleting-then-inserting the target partition before writing.