What Is CDC? Change Data Capture Explained
What Is CDC? Change Data Capture Explained
Written by The Data Workers Team — 14 autonomous agents shipping production data infrastructure since 2026.
Technically reviewed by the Data Workers engineering team.
Last updated .
Change data capture (CDC) is a pattern that tracks and streams every row-level change in a source database — insert, update, delete — so downstream systems stay in sync without full table reloads. CDC is how modern pipelines ingest operational data from Postgres, MySQL, Oracle, and SQL Server into cloud warehouses in near real time.
CDC is the backbone of low-latency analytics and event-driven architectures. This guide walks through how CDC works, the two main approaches, and the tools that implement it in production stacks.
Before CDC became mainstream, the standard pattern for syncing operational databases to analytics was nightly full-table dumps — extract every row, truncate the destination, reload. This worked at small scale but broke at medium scale and was unthinkable at large scale. CDC emerged as the answer: capture only what changed, in near real time, and apply it incrementally downstream.
How CDC Works
CDC reads the source database's transaction log (WAL in Postgres, binlog in MySQL, redo log in Oracle) and converts each row change into an event. Each event contains the operation type, the before image, the after image, and a timestamp. Downstream consumers apply the events in order to keep their copies in sync.
The transaction log is the internal ledger every relational database already keeps for durability. Reading from it is both efficient (no extra query load on the source) and complete (every change is captured, including deletes). The catch is that log formats are database-specific and often change between versions, so CDC tools embed per-database parsers that track each vendor's log format closely.
| Database | Log Name | CDC Method |
|---|---|---|
| Postgres | WAL | Logical replication slots |
| MySQL | binlog | Row-based binlog streaming |
| Oracle | redo / archive log | LogMiner / GoldenGate |
| SQL Server | transaction log | CDC tables or change tracking |
| MongoDB | oplog | Change streams |
Log-Based vs Query-Based CDC
Log-based CDC reads the transaction log directly — low latency, no impact on source, catches every change including deletes. Query-based CDC polls the source with SELECT WHERE updated_at > X — higher latency, impacts the source, misses hard deletes. Log-based is almost always the right choice when available.
Query-based CDC is easier to set up but has real limitations: no delete tracking, missed updates if two changes happen within one poll interval, and source load at every poll. Modern tools default to log-based wherever possible.
CDC Tools
- •Debezium — open source, Kafka-native, widely deployed
- •Fivetran — managed SaaS, wide source support
- •Airbyte — open source + managed, growing CDC support
- •AWS DMS — managed migration and CDC service
- •Stitch — acquired by Talend, SaaS CDC
CDC Use Cases
The most common CDC use case is near-real-time replication from operational databases to cloud warehouses — "mirror my Postgres into Snowflake with 30-second lag." CDC also powers event-driven architectures, search index updates, cache invalidation, and audit trails. Any system that needs to react to database changes benefits from CDC.
An increasingly common pattern is fan-out CDC: one source database streamed to several consumers in parallel. The warehouse gets the full stream for analytics, a search index gets the subset it needs, a cache gets invalidations, and an audit system gets a verbatim copy. The CDC pipeline becomes the central nervous system of the architecture, letting teams add new consumers without touching the source database.
CDC Challenges
CDC is harder than it looks. Schema changes in the source break downstream consumers. Bootstrap snapshots are expensive. Log retention policies affect recovery. Ordering across tables is tricky. Data engineers who implement CDC learn about all these edge cases the hard way — or use a managed tool that handles them.
Bootstrap snapshots deserve special attention. When you first connect a CDC tool to a source, you need a consistent snapshot of the current data before starting to stream changes. Taking that snapshot without blocking the source database or missing events is non-trivial, and every CDC tool handles it slightly differently. Test the bootstrap path carefully before relying on it in production — a failed bootstrap midway through a large table is one of the worst failure modes to recover from.
For related topics see what is a data pipeline and how to handle schema evolution.
CDC in Modern Stacks
Most cloud-native stacks use CDC through a managed tool: Fivetran for enterprise workloads, Airbyte for open source, Debezium for Kafka-first architectures. Data Workers pipeline agents integrate with all three, monitoring replication lag, diagnosing failures, and handling schema evolution automatically.
Book a demo to see autonomous CDC management.
Real-World Examples
A SaaS company replicates its Postgres database into Snowflake using Fivetran's log-based CDC, with a target replication lag of 5 minutes. Analysts query near-real-time subscription state in the warehouse without touching production. A marketplace uses Debezium with Kafka to stream order events from MySQL into both a search index (Elasticsearch) and an analytics warehouse (BigQuery) simultaneously — one source, multiple consumers. A fintech uses Oracle GoldenGate for CDC from a legacy Oracle database into Databricks, meeting regulatory requirements for audit trails.
When You Need CDC
You need CDC whenever an operational database contains data that analytics or another system must see within minutes. Nightly full-table exports break down above a few gigabytes and cannot hit sub-hour latency. If your dashboards need to reflect today's activity, or your downstream systems need to react to changes quickly, CDC is the only practical pattern. Below that latency tier, simpler options like incremental exports based on updated_at columns often suffice.
Common Misconceptions
CDC is not only for streaming — many CDC pipelines run in micro-batches that feel like streaming but operate on 1-5 minute intervals. CDC also does not require Kafka; while Debezium uses Kafka, Fivetran and Airbyte implement CDC without any message broker involved. And CDC does not eliminate the need for bootstrap snapshots — every new consumer starts with a snapshot, then transitions to streaming changes. Plan for both phases.
Change data capture reads the source database's transaction log and streams row-level changes to downstream consumers. Log-based CDC is the modern default. Use a managed tool unless you have strong reasons to build your own, and automate the operational edge cases so CDC does not become your biggest ops burden.
Further Reading
Sources
See Data Workers in action
15 autonomous AI agents working across your entire data stack. MCP-native, open-source, deployed in minutes.
Book a DemoRelated Resources
- CDC Tools Comparison: Debezium, Fivetran, Airbyte, Estuary — Comprehensive comparison of 2026 CDC tools across latency tiers, pricing, compliance, and source coverage.
- What is a Context Layer for AI Agents? — AI agents writing SQL against your data warehouse get it wrong 66% more often without semantic grounding. A context layer fixes this by g…
- What is a Context Graph? The Knowledge Layer AI Agents Need — A context graph is a knowledge graph of your data ecosystem — relationships, lineage, quality scores, ownership, and semantic definitions…
- What is Data Observability? The Data Engineer's Complete Guide — Data observability provides visibility into data health across your stack. This guide covers the five pillars, tool landscape, and how AI…
- What Is Metadata? Complete Guide for Data Teams [2026] — Definitional guide to metadata covering technical, business, operational, and social types, with active metadata patterns and AI agent gr…
- Meta Data Meaning: Definition, Examples, and Why It Matters — Plain-language definition of meta data with examples and use cases for analysts, engineers, auditors, and AI agents.
- What Is Data Governance With Example: A Practical Guide — Real-world data governance examples from healthcare PHI, banking BCBS 239, and ecommerce GDPR with shared design principles.
- What Is RDBMS? Relational Database Management Systems Explained — Definition and core features of relational database management systems with comparison of major products and modern AI use cases.
- What Is Data Modernization? A 2026 Strategy Guide — Strategy guide covering the four phases of data modernization, common pitfalls, and how to make data AI-ready in 2026.
- What Is a Data Domain? Definition and Examples for Data Mesh — Guide to identifying data domains, using them in data mesh, and applying domain ownership in centralized stacks.
- What Is Data Transparency? Definition and Best Practices — Guide to data transparency including the five characteristics of transparent systems and how AI-native catalogs make transparency automatic.
- What Is Cross-Tabulation? Definition, Examples, and Use Cases — Statistical technique guide covering cross-tab structure, SQL implementation, common use cases, and pitfalls like Simpson's paradox.
Explore Topic Clusters
- Data Governance: The Complete Guide — Policies, access controls, PII, and compliance at scale.
- Data Catalog: The Complete Guide — Discovery, metadata, lineage, and the modern catalog stack.
- Data Lineage: The Complete Guide — Column-level lineage, impact analysis, and observability.
- Data Quality: The Complete Guide — Tests, SLAs, anomaly detection, and data reliability engineering.
- AI Data Engineering: The Complete Guide — LLMs, agents, and autonomous workflows across the data stack.
- MCP for Data: The Complete Guide — Model Context Protocol servers, tools, and agent integration.
- Data Mesh & Data Fabric: The Complete Guide — Federated ownership, domain-oriented architecture, and interop.
- Open-Source Data Stack: The Complete Guide — dbt, Airflow, Iceberg, DuckDB, and the modern OSS toolkit.