datarekha
Data Engineering Medium Asked at DebeziumAsked at ConfluentAsked at FivetranAsked at AirbyteAsked at Amazon

What is Change Data Capture (CDC) and how is it implemented?

The short answer

CDC continuously captures row-level inserts, updates, and deletes from a source database and streams them downstream — enabling near-real-time replication to a warehouse or data lake without full table scans. The most robust implementation reads the database's write-ahead log (WAL), making it low-impact on the source and capable of capturing deletes that polling-based approaches miss entirely.

How to think about it

CDC solves a core problem: production databases get updated in place, but your warehouse needs to reflect every change, including deletes and updates that leave no timestamp trail.

Why polling fails

A common naive approach polls the source table for rows where updated_at > last_run_time. This misses:

  • Hard deletes — the row is gone; no updated_at to query.
  • Updates with no timestamp — schema changes outside your control.
  • High-frequency updates — polling every minute still means 60-second lag; log-based CDC can deliver sub-second.

Log-based CDC

Most production databases maintain a write-ahead log (WAL in PostgreSQL, binary log in MySQL, redo log in Oracle) for crash recovery. CDC tools like Debezium tail this log as a consumer.

Every committed transaction emits structured change events:

{
  "op": "u",
  "before": { "order_id": 123, "status": "pending" },
  "after":  { "order_id": 123, "status": "shipped" },
  "ts_ms": 1749168000000,
  "source": { "table": "orders", "lsn": 4029843200 }
}

Operations: c = create, u = update, d = delete, r = initial snapshot read.

Typical architecture

PostgreSQL WAL → Debezium → Kafka topic → Kafka Connect sink → Snowflake / Delta Lake

For simpler setups, Fivetran and Airbyte implement log-based CDC without managing Kafka. They expose it as a connector configuration option.

Handling CDC events in the warehouse

Downstream, you apply changes in order using a MERGE statement:

MERGE INTO orders AS target
USING cdc_orders_staging AS src
  ON target.order_id = src.order_id
WHEN MATCHED AND src.op = 'd' THEN DELETE
WHEN MATCHED AND src.op = 'u' THEN UPDATE SET status = src.after_status
WHEN NOT MATCHED AND src.op = 'c' THEN INSERT (order_id, status) VALUES (src.after_order_id, src.after_status);

PostgreSQL setup

-- Enable logical replication (requires postgres >= 10)
ALTER SYSTEM SET wal_level = 'logical';
-- Create a replication slot for Debezium
SELECT pg_create_logical_replication_slot('debezium_slot', 'pgoutput');
Learn it properly Change Data Capture (CDC)

Keep practising

All Data Engineering questions

Explore further

Skip to content