Change Data Capture Explained: How CDC Keeps Warehouses in Sync
Change Data Capture Explained: How CDC Keeps Warehouses in Sync
Written by The Data Workers Team — 14 autonomous agents shipping production data infrastructure since 2026.
Technically reviewed by the Data Workers engineering team.
Last updated .
Change data capture (CDC) is the pattern of reading a source database's transaction log and streaming every insert, update, and delete to downstream systems. It is the most efficient way to keep a warehouse, cache, or search index in sync with a production database — no batch queries, no table locks, sub-second latency when done right.
This guide explains what CDC actually is, the main implementation approaches, the popular tools, where modern lakehouses and AI agents change the tradeoffs, and the operational traps (schema drift, initial snapshot cost, deletes) that every team hits in their first six months of CDC in production.
What Is Change Data Capture?
CDC captures row-level changes from a source database the moment they happen. Instead of running SELECT * FROM orders every hour to find new rows, a CDC process reads the database's write-ahead log (MySQL binlog, Postgres WAL, SQL Server CDC tables, Oracle redo logs) and streams every committed change as an event. Downstream systems apply those events to stay in sync.
The win is efficiency and correctness. Batch extracts hammer the source, miss fast updates, and fall behind. CDC reads from the log stream without touching application tables, so it scales with write volume rather than table size. It also catches deletes and updates — things a naive 'where updated_at > X' query will miss because the deleted row is already gone by the time you look.
CDC Implementation Approaches
| Approach | How It Works | Tradeoffs |
|---|---|---|
| Log-based | Read WAL, binlog, or redo log | Best performance, requires DB privileges |
| Trigger-based | DB trigger writes to audit table | Works anywhere, adds write overhead |
| Query-based | Poll updated_at column | Simple, misses deletes and fast updates |
| Timestamp-based | Compare snapshots | Simple, very stale, expensive |
Why Log-Based CDC Wins
Log-based is the gold standard. The database is already writing a transaction log for durability, so reading it is essentially free — no extra writes, no query load, no missed changes. Debezium, Fivetran, Airbyte, Estuary, and Datastream all use log-based CDC where available. Query-based CDC exists as a fallback for systems that expose no log access.
The tradeoff is that log-based CDC requires elevated database privileges — REPLICATION SLAVE on MySQL, a replication slot on Postgres, CDC enabled on SQL Server. Getting these permissions in regulated environments can take weeks. Once granted, though, the operational payoff is massive — you get every change with millisecond latency and near-zero overhead on the source.
Log-based CDC also naturally captures transaction boundaries, which matters for correctness. If a transaction updates five rows atomically, log-based CDC emits all five changes as a single consistent batch. Query-based CDC cannot guarantee this and can split transactions across polling cycles, which causes downstream consumers to observe inconsistent intermediate states.
Popular CDC Tools
The 2026 CDC landscape has three categories. Open-source log readers like Debezium that you host yourself. Managed connectors like Fivetran, Airbyte, and Estuary that handle the plumbing. Cloud-native services like AWS DMS, GCP Datastream, and Azure Data Factory.
See cdc tools comparison for the full head-to-head and debezium vs fivetran for the two most common choices that most teams end up comparing first.
CDC Into a Lakehouse
The dominant 2026 pattern is streaming CDC events into an Iceberg or Delta table via an upsert or merge. Every change becomes an upsert, and the downstream lakehouse stays within seconds of production. Table formats support this efficiently via row-level deletes, and streaming engines like Flink or Spark Structured Streaming handle the plumbing.
- •Source — production Postgres / MySQL / Oracle
- •Capture — Debezium reads WAL, emits Kafka events
- •Land — Flink or Spark writes to Iceberg with merge
- •Serve — Trino/Snowflake/Databricks query the Iceberg table
- •Monitor — freshness SLAs alert if the lag grows
Common Pitfalls
Schema drift is the #1 CDC headache — a new column added to production must propagate without breaking the pipeline. Initial snapshots take a long time on large tables and can saturate the source. Deletes need explicit handling downstream because most warehouses default to append-only. Ordering guarantees matter when rows can be updated multiple times within the same batch.
Replication slot bloat is another silent killer on Postgres — if your CDC consumer falls behind, the WAL grows indefinitely and can fill the source disk. Always monitor replication lag and alert before it becomes a source outage.
Initial Snapshot Strategy
Before you can stream changes, you need an initial snapshot of the current state. For small tables this is instant; for billion-row tables it can take hours and put real load on the source. Debezium supports incremental snapshotting to parallelize this; Fivetran and Airbyte handle it transparently. Plan the initial snapshot for off-peak hours and coordinate with the DBA.
Incremental snapshotting is the 2024+ pattern that avoids locking the source for hours. Instead of SELECT * FROM table, the CDC tool scans in chunks with watermarks, interleaving with live change events. Debezium's signal-based incremental snapshot and Fivetran's native approach both work this way. It is the difference between a feasible and infeasible CDC rollout on large tables.
Deleted Row Handling
Most analytics warehouses are append-only or upsert-friendly, not delete-aware. When CDC surfaces a delete, you need to decide how to represent it downstream. Soft deletes — add an is_deleted flag and keep the row — are the safest because you do not lose history. Hard deletes — remove the row from the target — match source state exactly but destroy history. Pick one per table based on whether the business cares about 'what used to exist'.
The GDPR wrinkle is that some deletes are legally required to actually delete. If a user invokes right-to-be-forgotten, the downstream warehouse must hard-delete their PII even if the default strategy is soft-delete. Build the erasure workflow into your CDC pipeline explicitly — do not assume soft-delete is the right answer for every table.
Agent-Managed CDC
Data Workers' pipeline agent detects schema drift in CDC streams, evolves the downstream tables automatically, and escalates when a change requires human approval. See autonomous data engineering or book a demo.
Change data capture is how you keep a warehouse in sync with production without beating up the source. Log-based is the standard, Iceberg is the landing format, and agents are the new operational layer — combine them and you get sub-second freshness without manual babysitting.
Further Reading
Sources
See Data Workers in action
15 autonomous AI agents working across your entire data stack. MCP-native, open-source, deployed in minutes.
Book a DemoRelated Resources
- Why AI Agents Need MCP Servers for Data Engineering — MCP servers give AI agents structured access to your data tools — Snowflake, BigQuery, dbt, Airflow, and more. Here is why MCP is the int…
- The Complete Guide to Agentic Data Engineering with MCP — Agentic data engineering replaces manual pipeline management with autonomous AI agents. Here is how to implement it with MCP — without lo…
- RBAC for Data Engineering Teams: Why Manual Access Control Doesn't Scale — Manual RBAC breaks down at 50+ data assets. Policy drift, orphaned permissions, and PII exposure become inevitable. AI agents enforce gov…
- From Alert to Resolution in Minutes: How AI Agents Debug Data Pipeline Incidents — The average data pipeline incident takes 4-8 hours to resolve. AI agents that understand your full data context can auto-diagnose and res…
- Build Data Pipelines with AI: From Description to Deployment in Minutes — Building a data pipeline still takes 2-6 weeks of engineering time. AI agents that understand your data context can generate, test, and d…
- Why Your Data Catalog Is Always Out of Date (And How AI Agents Fix It) — 40-60% of data catalog entries are outdated at any given time. AI agents that continuously scan, classify, and update metadata make the s…
- Data Migration Automation: How AI Agents Reduce 18-Month Timelines to Weeks — Enterprise data migrations take 6-18 months because schema mapping, data validation, and downtime coordination are manual. AI agents comp…
- Stop Building Data Connectors: How AI Agents Auto-Generate Integrations — Data teams spend 20-30% of their time maintaining connectors. AI agents that auto-generate and self-heal integrations eliminate this main…
- Why One AI Agent Isn't Enough: Coordinating Agent Swarms Across Your Data Stack — A single AI agent can handle one domain. But data engineering spans 10+ domains — quality, governance, pipelines, schema, streaming, cost…
- Data Contracts for Data Engineers: How AI Agents Enforce Schema Agreements — Data contracts define the agreement between data producers and consumers. AI agents enforce them automatically — detecting violations, no…
- The Data Incident Response Playbook: From Alert to Root Cause in Minutes — Most data teams lack a formal incident response process. This playbook provides severity levels, triage workflows, root cause analysis st…
- 10 Data Engineering Tasks You Should Automate Today — Data engineers spend the majority of their time on repetitive tasks that AI agents can handle. Here are 10 tasks to automate today — from…
Explore Topic Clusters
- Data Governance: The Complete Guide — Policies, access controls, PII, and compliance at scale.
- Data Catalog: The Complete Guide — Discovery, metadata, lineage, and the modern catalog stack.
- Data Lineage: The Complete Guide — Column-level lineage, impact analysis, and observability.
- Data Quality: The Complete Guide — Tests, SLAs, anomaly detection, and data reliability engineering.
- AI Data Engineering: The Complete Guide — LLMs, agents, and autonomous workflows across the data stack.
- MCP for Data: The Complete Guide — Model Context Protocol servers, tools, and agent integration.
- Data Mesh & Data Fabric: The Complete Guide — Federated ownership, domain-oriented architecture, and interop.
- Open-Source Data Stack: The Complete Guide — dbt, Airflow, Iceberg, DuckDB, and the modern OSS toolkit.