What is Change Data Capture (CDC) and how is it implemented?
CDC continuously captures row-level inserts, updates, and deletes from a source database and streams them downstream — enabling near-real-time replication to a warehouse or data lake without full table scans. The most robust implementation reads the database's write-ahead log (WAL), making it low-impact on the source and capable of capturing deletes that polling-based approaches miss entirely.
How to think about it
CDC solves a core problem: production databases get updated in place, but your warehouse needs to reflect every change, including deletes and updates that leave no timestamp trail.
Why polling fails
A common naive approach polls the source table for rows where updated_at > last_run_time. This misses:
- Hard deletes — the row is gone; no
updated_atto query. - Updates with no timestamp — schema changes outside your control.
- High-frequency updates — polling every minute still means 60-second lag; log-based CDC can deliver sub-second.
Log-based CDC
Most production databases maintain a write-ahead log (WAL in PostgreSQL, binary log in MySQL, redo log in Oracle) for crash recovery. CDC tools like Debezium tail this log as a consumer.
Every committed transaction emits structured change events:
{
"op": "u",
"before": { "order_id": 123, "status": "pending" },
"after": { "order_id": 123, "status": "shipped" },
"ts_ms": 1749168000000,
"source": { "table": "orders", "lsn": 4029843200 }
}
Operations: c = create, u = update, d = delete, r = initial snapshot read.
Typical architecture
PostgreSQL WAL → Debezium → Kafka topic → Kafka Connect sink → Snowflake / Delta Lake
For simpler setups, Fivetran and Airbyte implement log-based CDC without managing Kafka. They expose it as a connector configuration option.
Handling CDC events in the warehouse
Downstream, you apply changes in order using a MERGE statement:
MERGE INTO orders AS target
USING cdc_orders_staging AS src
ON target.order_id = src.order_id
WHEN MATCHED AND src.op = 'd' THEN DELETE
WHEN MATCHED AND src.op = 'u' THEN UPDATE SET status = src.after_status
WHEN NOT MATCHED AND src.op = 'c' THEN INSERT (order_id, status) VALUES (src.after_order_id, src.after_status);
PostgreSQL setup
-- Enable logical replication (requires postgres >= 10)
ALTER SYSTEM SET wal_level = 'logical';
-- Create a replication slot for Debezium
SELECT pg_create_logical_replication_slot('debezium_slot', 'pgoutput');