datarekha

Change Data Capture (CDC)

How modern pipelines keep a warehouse fresh to the second — by streaming every insert, update, and delete out of a source database the instant it commits. Log-based vs query-based capture, why only the log sees deletes, Debezium and the WAL/binlog, and feeding SCD Type 2.

9 min read Intermediate SQL Lesson 25 of 27

What you'll learn

  • What a change event is — operation, before/after image, and commit order
  • Query-based capture (polling) vs log-based capture (reading the transaction log)
  • Why polling is blind to deletes and intermediate states — and the log isn't
  • How Debezium reads the Postgres WAL / MySQL binlog and publishes change streams
  • Where CDC feeds — zero-ETL replication, SCD Type 2 via MERGE, and event-driven systems

Before you start

A nightly batch ETL job answers the question “what did the business look like yesterday?” For a growing list of use cases — fraud checks, live dashboards, keeping a search index or cache in sync — yesterday is far too old. Change Data Capture (CDC) is how you close the gap: instead of re-copying a whole table on a schedule, you capture every individual change — each insert, update, and delete — the moment it commits, and stream it onward. The warehouse stops being a stale snapshot and becomes a near-real-time mirror.

A change is a first-class event

The core mental shift: a row change is data. Each CDC event carries the operation (create / update / delete), the after-image (the new row), usually a before-image (the old row, for updates and deletes), and a position/timestamp (LSN, offset) that fixes its place in commit order. A consumer that replays this stream in order reconstructs the source table exactly — and can also react to each change.

Two ways to capture changes

There are two families, and the difference is the whole lesson.

Query-based (polling). Add an updated_at column, and every so often run SELECT * WHERE updated_at > :last_seen. Dead simple, needs no special database access. But it has holes you can’t patch: a DELETE leaves no updated_at to find, so deletes are invisible; you only ever see the latest state between polls, missing intermediate changes; and frequent polling hammers the source while still adding latency.

Log-based. Every transactional database already writes an ordered, durable record of every change — the write-ahead log (Postgres WAL, MySQL binlog, Oracle redo, SQL Server transaction log) — because that’s how it guarantees durability and replication. Log-based CDC reads that log and turns it into a change stream. It sees everything, including deletes and every intermediate update, in exact commit order, with minimal load on the source (reading the log is cheap; the database already wrote it).

Step through both and watch the target drift:

TryCDC stream

Stream changes from a source DB to the warehouse

Apply commits one at a time. Log-based CDC reads the transaction log and captures every operation; query-based polling misses deletes. Watch the target drift.

source — ordersOLTP
idcuststatustotal
101Ivopaid40
102Lenanew90
103Sunpaid25
change logWAL tail
no events yet
target — warehouseOLAP
idcuststatustotal
101Ivopaid40
102Lenanew90
103Sunpaid25

Log-based: the target is an exact mirror — inserts, updates, and deletes all flow through the transaction log in commit order.

The log-based stack in practice

The canonical open-source tool is Debezium. It connects to a database’s replication interface — Postgres logical decoding, MySQL’s binlog replication protocol — decodes the log into structured change events, and publishes them, most often to Apache Kafka. Downstream, sink connectors land those events in a warehouse, a lake, Elasticsearch, or another service. The managed cloud equivalents (AWS DMS, Fivetran, GCP Datastream, and the zero-ETL integrations you met in the OLTP-vs-OLAP lesson) wrap the same log-reading idea behind a UI.

A real pipeline also handles the initial snapshot: before it can stream changes, it reads the table’s current contents once, then switches to tailing the log from the exact LSN where the snapshot ended — so nothing is missed or double-counted at the boundary.

What CDC feeds

  • Warehouse / lakehouse replication. Keep an analytical copy continuously in sync instead of nightly reloads — the modern, low-latency face of ELT.
  • SCD Type 2. The change stream is the input to the dimension-history MERGE you saw last lesson: each captured update expires the old row and inserts a new version.
  • Event-driven systems. Invalidate caches, update search indexes, or emit domain events (the transactional outbox pattern) — all driven off the same committed changes, with no dual-write inconsistency.

The hard parts

CDC isn’t free. Streams can deliver a change more than once after a retry, so consumers must be idempotent — apply by primary key with an upsert so a replayed event is a no-op. Ordering must be preserved per key (Kafka does this by partitioning on the key). Schema changes at the source (a new column, a dropped one) have to propagate without breaking consumers. And you need monitoring for replication lag — if the consumer falls behind the log, freshness silently degrades.

Quick check

Quick check

0/3
Q1Why is query-based CDC (polling WHERE updated_at > last_seen) fundamentally unable to capture deletes?
Q2What does log-based CDC read, and why does that make it both complete and low-overhead?
Q3TRANSFER: Your CDC consumer occasionally receives the same change event twice after a network retry. What design property prevents this from corrupting the target, and how?

Next

CDC streams data into the warehouse, where it’s modeled, versioned, and stored columnar for analytics. But the warehouse isn’t the last stop. Often the insights it produces — a churn score, a customer segment, a fresh lead list — need to travel back out to the tools where people work. That’s reverse ETL, next.

FAQCommon questions

Questions about this lesson

What is Change Data Capture (CDC)?

CDC captures every individual change — insert, update, and delete — made to a source database and streams it onward the moment it commits, instead of re-copying whole tables on a schedule. Each change event carries the operation, the new (and usually old) row image, and a position in commit order, so a consumer can keep a warehouse or other system continuously in sync.

What is the difference between log-based and query-based CDC?

Query-based CDC polls the source with something like WHERE updated_at > last_seen — simple, but blind to deletes and to intermediate states between polls. Log-based CDC reads the database's transaction log (Postgres WAL, MySQL binlog), so it captures every insert, update, and delete in exact commit order with minimal load. Log-based is the robust choice; Debezium is the common tool.

Why can't query-based CDC capture deletes?

Query-based polling finds rows whose updated_at timestamp advanced. A DELETE removes the row entirely and leaves no timestamp behind, so the next poll has nothing to match and never sees the row disappear. Only reading the transaction log (or using soft-delete tombstones) captures deletions — which is why deleted rows linger as stale ghosts under polling.

Practice this in an interview

All questions
What is Change Data Capture (CDC) and how is it implemented?

CDC continuously captures row-level inserts, updates, and deletes from a source database and streams them downstream — enabling near-real-time replication to a warehouse or data lake without full table scans. The most robust implementation reads the database's write-ahead log (WAL), making it low-impact on the source and capable of capturing deletes that polling-based approaches miss entirely.

What is the difference between batch and streaming data pipelines, and how do you choose between them?

Batch pipelines process data in bounded chunks on a schedule — simple to build and test, but latency is measured in hours or days. Streaming pipelines process records continuously as they arrive — latency drops to seconds or milliseconds, but correctness requires handling late arrivals, watermarks, and stateful aggregations. Choose streaming when business decisions need fresh data; choose batch when daily freshness is acceptable and operational simplicity matters.

How do you handle schema evolution in data pipelines without breaking downstream consumers?

Schema evolution covers adding, renaming, removing, or retyping columns in a data stream or table over time. Safe strategies include: only adding nullable columns (backwards-compatible), using schema registries to enforce compatibility rules before a producer publishes, and open table formats like Iceberg that track schema history and allow column renames and reorders without rewriting data.

What does idempotency mean for a data pipeline, and how do you make a pipeline idempotent?

An idempotent pipeline produces the same output no matter how many times it runs for the same logical window — rerunning it on an already-processed date partition yields identical results rather than duplicated rows. Achieving idempotency typically means using INSERT OVERWRITE (or MERGE) instead of plain INSERT, keying every record with a deterministic ID, and deleting-then-inserting the target partition before writing.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content