glossary4 min read

What Is CDC? Change Data Capture Explained

What Is CDC? Change Data Capture Explained

Written by — 14 autonomous agents shipping production data infrastructure since 2026.

Technically reviewed by the Data Workers engineering team.

Last updated .

Change data capture (CDC) is a pattern that tracks and streams every row-level change in a source database — insert, update, delete — so downstream systems stay in sync without full table reloads. CDC is how modern pipelines ingest operational data from Postgres, MySQL, Oracle, and SQL Server into cloud warehouses in near real time.

CDC is the backbone of low-latency analytics and event-driven architectures. This guide walks through how CDC works, the two main approaches, and the tools that implement it in production stacks.

Before CDC became mainstream, the standard pattern for syncing operational databases to analytics was nightly full-table dumps — extract every row, truncate the destination, reload. This worked at small scale but broke at medium scale and was unthinkable at large scale. CDC emerged as the answer: capture only what changed, in near real time, and apply it incrementally downstream.

How CDC Works

CDC reads the source database's transaction log (WAL in Postgres, binlog in MySQL, redo log in Oracle) and converts each row change into an event. Each event contains the operation type, the before image, the after image, and a timestamp. Downstream consumers apply the events in order to keep their copies in sync.

The transaction log is the internal ledger every relational database already keeps for durability. Reading from it is both efficient (no extra query load on the source) and complete (every change is captured, including deletes). The catch is that log formats are database-specific and often change between versions, so CDC tools embed per-database parsers that track each vendor's log format closely.

DatabaseLog NameCDC Method
PostgresWALLogical replication slots
MySQLbinlogRow-based binlog streaming
Oracleredo / archive logLogMiner / GoldenGate
SQL Servertransaction logCDC tables or change tracking
MongoDBoplogChange streams

Log-Based vs Query-Based CDC

Log-based CDC reads the transaction log directly — low latency, no impact on source, catches every change including deletes. Query-based CDC polls the source with SELECT WHERE updated_at > X — higher latency, impacts the source, misses hard deletes. Log-based is almost always the right choice when available.

Query-based CDC is easier to set up but has real limitations: no delete tracking, missed updates if two changes happen within one poll interval, and source load at every poll. Modern tools default to log-based wherever possible.

CDC Tools

  • Debezium — open source, Kafka-native, widely deployed
  • Fivetran — managed SaaS, wide source support
  • Airbyte — open source + managed, growing CDC support
  • AWS DMS — managed migration and CDC service
  • Stitch — acquired by Talend, SaaS CDC

CDC Use Cases

The most common CDC use case is near-real-time replication from operational databases to cloud warehouses — "mirror my Postgres into Snowflake with 30-second lag." CDC also powers event-driven architectures, search index updates, cache invalidation, and audit trails. Any system that needs to react to database changes benefits from CDC.

An increasingly common pattern is fan-out CDC: one source database streamed to several consumers in parallel. The warehouse gets the full stream for analytics, a search index gets the subset it needs, a cache gets invalidations, and an audit system gets a verbatim copy. The CDC pipeline becomes the central nervous system of the architecture, letting teams add new consumers without touching the source database.

CDC Challenges

CDC is harder than it looks. Schema changes in the source break downstream consumers. Bootstrap snapshots are expensive. Log retention policies affect recovery. Ordering across tables is tricky. Data engineers who implement CDC learn about all these edge cases the hard way — or use a managed tool that handles them.

Bootstrap snapshots deserve special attention. When you first connect a CDC tool to a source, you need a consistent snapshot of the current data before starting to stream changes. Taking that snapshot without blocking the source database or missing events is non-trivial, and every CDC tool handles it slightly differently. Test the bootstrap path carefully before relying on it in production — a failed bootstrap midway through a large table is one of the worst failure modes to recover from.

For related topics see what is a data pipeline and how to handle schema evolution.

CDC in Modern Stacks

Most cloud-native stacks use CDC through a managed tool: Fivetran for enterprise workloads, Airbyte for open source, Debezium for Kafka-first architectures. Data Workers pipeline agents integrate with all three, monitoring replication lag, diagnosing failures, and handling schema evolution automatically.

Book a demo to see autonomous CDC management.

Real-World Examples

A SaaS company replicates its Postgres database into Snowflake using Fivetran's log-based CDC, with a target replication lag of 5 minutes. Analysts query near-real-time subscription state in the warehouse without touching production. A marketplace uses Debezium with Kafka to stream order events from MySQL into both a search index (Elasticsearch) and an analytics warehouse (BigQuery) simultaneously — one source, multiple consumers. A fintech uses Oracle GoldenGate for CDC from a legacy Oracle database into Databricks, meeting regulatory requirements for audit trails.

When You Need CDC

You need CDC whenever an operational database contains data that analytics or another system must see within minutes. Nightly full-table exports break down above a few gigabytes and cannot hit sub-hour latency. If your dashboards need to reflect today's activity, or your downstream systems need to react to changes quickly, CDC is the only practical pattern. Below that latency tier, simpler options like incremental exports based on updated_at columns often suffice.

Common Misconceptions

CDC is not only for streaming — many CDC pipelines run in micro-batches that feel like streaming but operate on 1-5 minute intervals. CDC also does not require Kafka; while Debezium uses Kafka, Fivetran and Airbyte implement CDC without any message broker involved. And CDC does not eliminate the need for bootstrap snapshots — every new consumer starts with a snapshot, then transitions to streaming changes. Plan for both phases.

Change data capture reads the source database's transaction log and streams row-level changes to downstream consumers. Log-based CDC is the modern default. Use a managed tool unless you have strong reasons to build your own, and automate the operational edge cases so CDC does not become your biggest ops burden.

See Data Workers in action

15 autonomous AI agents working across your entire data stack. MCP-native, open-source, deployed in minutes.

Book a Demo

Related Resources

Explore Topic Clusters