Change Data Capture (CDC)

What you'll learn

What a change event is — operation, before/after image, and commit order

Query-based capture (polling) versus log-based capture (reading the transaction log)

Why polling is blind to deletes — and the log isn't

Where CDC feeds: warehouse replication, SCD Type 2, and event-driven systems

A nightly batch job answers “what did the business look like yesterday?” For a growing list of jobs — fraud checks, live dashboards, keeping a search index in sync — yesterday is far too old.

Change Data Capture closes that gap. Instead of re-copying a whole table on a schedule, it captures every individual change — each insert, update, and delete — the moment it commits, and streams it onward. The warehouse stops being a stale snapshot and becomes a near-real-time mirror. The mental shift is that a row change is now data: each CDC event carries the operation (create / update / delete), the new row, usually the old row too, and a position that fixes its place in commit order. Replay that stream in order and you reconstruct the source table exactly.

Two ways to capture, and one decisive difference

There are two families of CDC, and the gap between them is the whole lesson. Query-based (polling) adds an updated_at column and periodically runs SELECT * WHERE updated_at > :last_seen. Dead simple, no special access — but it has a hole you cannot patch: a DELETE removes the row entirely, leaving no updated_at for the next query to find, so deletes are invisible. It also only sees the latest state between polls, missing intermediate changes. Log-based capture instead reads the database’s write-ahead log (Postgres WAL, MySQL binlog) — the ordered, durable record every transactional database already writes for crash recovery — and turns it into a change stream. It sees everything, deletes included, in exact commit order, at minimal cost, because the database wrote that log anyway.

A deleted row leaves nothing for a poller to find. Only the transaction log records the removal.

The canonical tool is Debezium: it reads Postgres logical decoding or MySQL binlog, turns the log into structured change events, and publishes them (usually to Kafka), where sink connectors land them in a warehouse or index. A real pipeline also takes an initial snapshot of the table’s current contents once, then tails the log from the exact position where the snapshot ended — so nothing is missed or double-counted at the boundary. From there CDC feeds warehouse replication (the low-latency face of ELT), drives SCD Type 2 (the change stream is the input to the dimension-history MERGE), and powers event-driven systems (cache invalidation, search indexing) off the same committed changes.

Practice

Quick check

0/3

Q1Why is query-based CDC (polling WHERE updated_at > last_seen) fundamentally unable to capture deletes?

Q2What does log-based CDC read, and why is it both complete and low-overhead?

Q3TRANSFER: Your CDC consumer occasionally gets the same event twice after a network retry. What property prevents corruption, and how?

Questions about this lesson

What is Change Data Capture (CDC)?

CDC captures every individual change — insert, update, and delete — made to a source database and streams it onward the moment it commits, instead of re-copying whole tables on a schedule. Each change event carries the operation, the new (and usually old) row image, and a position in commit order, so a consumer can keep a warehouse or other system continuously in sync.

What is the difference between log-based and query-based CDC?

Query-based CDC polls the source with something like WHERE updated_at > last_seen — simple, but blind to deletes and to intermediate states between polls. Log-based CDC reads the database's transaction log (Postgres WAL, MySQL binlog), so it captures every insert, update, and delete in exact commit order with minimal load. Log-based is the robust choice; Debezium is the common tool.

Why can't query-based CDC capture deletes?

Query-based polling finds rows whose updated_at timestamp advanced. A DELETE removes the row entirely and leaves no timestamp behind, so the next poll has nothing to match and never sees the row disappear. Only reading the transaction log (or using soft-delete tombstones) captures deletions — which is why deleted rows linger as stale ghosts under polling.

What you'll learn

Before you start

Two ways to capture, and one decisive difference

Practice

Quick check

Sign in to track your progress

Questions about this lesson

Practice this in an interview

Related lessons

Explore further