guide5 min read

Change Data Capture Explained: How CDC Keeps Warehouses in Sync

Change Data Capture Explained: How CDC Keeps Warehouses in Sync

Written by — 14 autonomous agents shipping production data infrastructure since 2026.

Technically reviewed by the Data Workers engineering team.

Last updated .

Change data capture (CDC) is the pattern of reading a source database's transaction log and streaming every insert, update, and delete to downstream systems. It is the most efficient way to keep a warehouse, cache, or search index in sync with a production database — no batch queries, no table locks, sub-second latency when done right.

This guide explains what CDC actually is, the main implementation approaches, the popular tools, where modern lakehouses and AI agents change the tradeoffs, and the operational traps (schema drift, initial snapshot cost, deletes) that every team hits in their first six months of CDC in production.

What Is Change Data Capture?

CDC captures row-level changes from a source database the moment they happen. Instead of running SELECT * FROM orders every hour to find new rows, a CDC process reads the database's write-ahead log (MySQL binlog, Postgres WAL, SQL Server CDC tables, Oracle redo logs) and streams every committed change as an event. Downstream systems apply those events to stay in sync.

The win is efficiency and correctness. Batch extracts hammer the source, miss fast updates, and fall behind. CDC reads from the log stream without touching application tables, so it scales with write volume rather than table size. It also catches deletes and updates — things a naive 'where updated_at > X' query will miss because the deleted row is already gone by the time you look.

CDC Implementation Approaches

ApproachHow It WorksTradeoffs
Log-basedRead WAL, binlog, or redo logBest performance, requires DB privileges
Trigger-basedDB trigger writes to audit tableWorks anywhere, adds write overhead
Query-basedPoll updated_at columnSimple, misses deletes and fast updates
Timestamp-basedCompare snapshotsSimple, very stale, expensive

Why Log-Based CDC Wins

Log-based is the gold standard. The database is already writing a transaction log for durability, so reading it is essentially free — no extra writes, no query load, no missed changes. Debezium, Fivetran, Airbyte, Estuary, and Datastream all use log-based CDC where available. Query-based CDC exists as a fallback for systems that expose no log access.

The tradeoff is that log-based CDC requires elevated database privileges — REPLICATION SLAVE on MySQL, a replication slot on Postgres, CDC enabled on SQL Server. Getting these permissions in regulated environments can take weeks. Once granted, though, the operational payoff is massive — you get every change with millisecond latency and near-zero overhead on the source.

Log-based CDC also naturally captures transaction boundaries, which matters for correctness. If a transaction updates five rows atomically, log-based CDC emits all five changes as a single consistent batch. Query-based CDC cannot guarantee this and can split transactions across polling cycles, which causes downstream consumers to observe inconsistent intermediate states.

The 2026 CDC landscape has three categories. Open-source log readers like Debezium that you host yourself. Managed connectors like Fivetran, Airbyte, and Estuary that handle the plumbing. Cloud-native services like AWS DMS, GCP Datastream, and Azure Data Factory.

See cdc tools comparison for the full head-to-head and debezium vs fivetran for the two most common choices that most teams end up comparing first.

CDC Into a Lakehouse

The dominant 2026 pattern is streaming CDC events into an Iceberg or Delta table via an upsert or merge. Every change becomes an upsert, and the downstream lakehouse stays within seconds of production. Table formats support this efficiently via row-level deletes, and streaming engines like Flink or Spark Structured Streaming handle the plumbing.

  • Source — production Postgres / MySQL / Oracle
  • Capture — Debezium reads WAL, emits Kafka events
  • Land — Flink or Spark writes to Iceberg with merge
  • Serve — Trino/Snowflake/Databricks query the Iceberg table
  • Monitor — freshness SLAs alert if the lag grows

Common Pitfalls

Schema drift is the #1 CDC headache — a new column added to production must propagate without breaking the pipeline. Initial snapshots take a long time on large tables and can saturate the source. Deletes need explicit handling downstream because most warehouses default to append-only. Ordering guarantees matter when rows can be updated multiple times within the same batch.

Replication slot bloat is another silent killer on Postgres — if your CDC consumer falls behind, the WAL grows indefinitely and can fill the source disk. Always monitor replication lag and alert before it becomes a source outage.

Initial Snapshot Strategy

Before you can stream changes, you need an initial snapshot of the current state. For small tables this is instant; for billion-row tables it can take hours and put real load on the source. Debezium supports incremental snapshotting to parallelize this; Fivetran and Airbyte handle it transparently. Plan the initial snapshot for off-peak hours and coordinate with the DBA.

Incremental snapshotting is the 2024+ pattern that avoids locking the source for hours. Instead of SELECT * FROM table, the CDC tool scans in chunks with watermarks, interleaving with live change events. Debezium's signal-based incremental snapshot and Fivetran's native approach both work this way. It is the difference between a feasible and infeasible CDC rollout on large tables.

Deleted Row Handling

Most analytics warehouses are append-only or upsert-friendly, not delete-aware. When CDC surfaces a delete, you need to decide how to represent it downstream. Soft deletes — add an is_deleted flag and keep the row — are the safest because you do not lose history. Hard deletes — remove the row from the target — match source state exactly but destroy history. Pick one per table based on whether the business cares about 'what used to exist'.

The GDPR wrinkle is that some deletes are legally required to actually delete. If a user invokes right-to-be-forgotten, the downstream warehouse must hard-delete their PII even if the default strategy is soft-delete. Build the erasure workflow into your CDC pipeline explicitly — do not assume soft-delete is the right answer for every table.

Agent-Managed CDC

Data Workers' pipeline agent detects schema drift in CDC streams, evolves the downstream tables automatically, and escalates when a change requires human approval. See autonomous data engineering or book a demo.

Change data capture is how you keep a warehouse in sync with production without beating up the source. Log-based is the standard, Iceberg is the landing format, and agents are the new operational layer — combine them and you get sub-second freshness without manual babysitting.

See Data Workers in action

15 autonomous AI agents working across your entire data stack. MCP-native, open-source, deployed in minutes.

Book a Demo

Related Resources

Explore Topic Clusters