dbt Snapshots Explained: SCD Type 2 in Five Lines of YAML
dbt Snapshots Explained: SCD Type 2 in Five Lines of YAML
Written by The Data Workers Team — 14 autonomous agents shipping production data infrastructure since 2026.
Technically reviewed by the Data Workers engineering team.
Last updated .
dbt snapshots implement Slowly Changing Dimension Type 2 with a single YAML config — they track history on mutable source tables by inserting a new row every time a tracked column changes. They are the fastest way to preserve history when your source system overwrites records in place.
This guide explains how snapshots work, when to use them, the subtle configuration choices that decide whether your history table stays trustworthy, and the disaster-recovery patterns you need because full-refreshing a snapshot wipes every historical row you have ever captured.
What Are dbt Snapshots?
A dbt snapshot is a special model type that runs a SQL query, compares the results to a target table, and inserts new rows with dbt_valid_from and dbt_valid_to timestamps whenever it sees changes. You point it at a mutable source (a production orders table, a CRM customers table) and it maintains a history table that preserves every version of every row.
This is Kimball's Slowly Changing Dimension Type 2, automated. Instead of writing merge logic and tracking versions by hand, you declare the snapshot in YAML and dbt handles the diffing on every run. The whole pattern is typically 10-20 lines of config per table.
Snapshot Strategies
| Strategy | How It Detects Change | Best For |
|---|---|---|
| timestamp | Watches an updated_at column | Sources with reliable updated_at |
| check | Hashes a list of columns | Sources without updated_at |
Timestamp Strategy
The timestamp strategy is the default and the most reliable. You tell dbt which column tracks updates (updated_at, modified_date) and it trusts any row with a newer timestamp to represent a change. This is efficient at scale because dbt only has to compare timestamps, not full row contents.
The catch is that your source has to update updated_at correctly on every change. If an application forgets to touch it, the snapshot silently misses history. Always verify with a test that updated_at has actually moved before relying on it — a dbt test that compares historical rowcount growth to expected update volume will catch silent drift.
Check Strategy
When your source has no updated_at, the check strategy hashes the tracked columns and compares hashes on every run. It is more expensive because dbt has to read all tracked columns, but it works on any source. Specify check_cols=['col1','col2'] to limit which columns trigger a new version — you probably do not want tiny metadata changes to flood the history table.
The check strategy also matters for sources that overwrite updated_at on every row touch (even if no meaningful attributes changed). In that case timestamp would produce duplicate versions; check only inserts when the hash actually differs.
Invalidate Hard Deletes
By default, snapshots cannot tell the difference between a deleted row and a row that has not changed. Set invalidate_hard_deletes=true to mark rows as deleted when they disappear from the source. This is critical if your source can delete records and you need to know when — for example, a CRM where marking a contact as inactive is actually a DELETE rather than a flag flip.
Common Pitfalls
- •Snapshot the raw source, not a transformed model — otherwise you lose the ability to rebuild transformations
- •Store snapshots in a dedicated schema — they are state, not derived
- •Never full-refresh a snapshot — you will lose all history; use a backup
- •Test dbt_valid_to is null or > dbt_valid_from — catches timestamp bugs
- •Run snapshots before incremental models — so downstream sees latest state
- •Back up snapshots daily — they are irreplaceable once lost
Disaster Recovery
Snapshots are the one table type in your warehouse that cannot be rebuilt from source — once the source overwrites a value, only your snapshot knows the old version existed. Back them up daily to object storage, test the restore, and never let a well-meaning engineer run dbt snapshot --full-refresh on a production schema. Add a pre-hook that checks environment and aborts on prod.
Time-travel capable warehouses (Snowflake, BigQuery, Databricks) give you a second safety net via Fail-safe or Time Travel, but those windows expire after 7-90 days. Long-term history needs explicit backups to cold storage. The cheapest pattern is a weekly dbt run-operation that exports the snapshot to Parquet on S3; the restore path is a simple COPY INTO back into the snapshot schema if disaster strikes.
Snapshot Performance at Scale
Snapshot runs get slower as the history table grows because dbt has to identify which source rows have changed. The timestamp strategy scales well because the comparison is cheap. The check strategy can slow dramatically on wide tables with many tracked columns — hashing every column every run is expensive. If your check-based snapshot is taking hours, reduce check_cols to the columns that matter for history and accept that silent changes to other columns will not be tracked.
For very large tables, consider partitioning the snapshot target by a logical key (e.g. tenant_id) so each run only scans a subset. This is warehouse-specific — Snowflake, BigQuery, and Databricks all handle it differently — but the payoff is linear speedup on multi-tenant data. Avoid running multiple snapshots on the same table concurrently; the merge logic is not safe across parallel runs.
Snapshots in the Broader SCD Toolbox
Snapshots cover SCD Type 2 beautifully. For Type 1 (overwrite), use regular dbt models. For Type 3 (prior value), add a column manually. For Type 4 (separate history table), snapshots are basically that pattern. For Data Vault satellites, see data vault vs kimball.
Mature dbt projects typically have a dedicated snapshots schema with one snapshot per mutable source table, run before the rest of the transformation DAG. Downstream models then read from the snapshot tables instead of the raw source — that way every transformation in the project sees consistent history, not just a point-in-time view that changes between runs.
Agent-Managed Snapshots
Data Workers' migration and catalog agents can detect when a new source table should be snapshotted, generate the YAML automatically, and monitor for silent history gaps. See autonomous data engineering or book a demo to see the flow.
dbt snapshots turn SCD Type 2 from a dreaded pattern into a five-line YAML file. Pick the right strategy, invalidate hard deletes when it matters, never full-refresh, and back them up — history is the whole point and you cannot get it back once it is gone.
Further Reading
Sources
See Data Workers in action
15 autonomous AI agents working across your entire data stack. MCP-native, open-source, deployed in minutes.
Book a DemoRelated Resources
- Why Your dbt Semantic Layer Needs an Agent Layer on Top — The dbt semantic layer is the best way to define metrics. But definitions alone don't prevent incidents or optimize queries. An agent lay…
- Claude Code + Snowflake/BigQuery/dbt: Integration Patterns for Data Teams — Practical integration patterns: Snowflake CLI + MCP, BigQuery MCP server, dbt MCP server with Claude Code.
- Data Engineering with dbt: The Modern Workflow — Covers dbt's role in modern data stacks, project structure, best practices, and automation.
- dbt Tests Best Practices: PKs, FKs, Severity, and CI — Best practices for dbt tests at scale: non-negotiables, severity config, CI integration, and organizing tests past 500 models.
- dbt Incremental Models: Strategies, unique_key, and Lookback Windows — Complete guide to dbt incremental models: strategies, unique_key, late-arriving data, cost tuning, and debugging drift.
- Apache Iceberg Explained: The Open Table Format That Won — Deep guide to Apache Iceberg: architecture, catalogs, features, migration from Hive, engine support, and production operations.
- Change Data Capture Explained: How CDC Keeps Warehouses in Sync — Guide to change data capture: log-based vs query-based, initial snapshots, deleted row handling, and landing CDC into lakehouses.
- dbt Alternatives in 2026: When Analytics Engineering Needs More — dbt is the analytics engineering standard. But Fivetran merger pricing, limited real-time support, and growing agent needs are driving te…
- dbt vs Dataform: Which SQL Transform Tool Wins? — Contrasts dbt (ecosystem leader) with Dataform (BigQuery-native), covers migration and the right tool per stack.
- dbt Cloud vs dbt Core: Feature and Pricing Comparison — Feature-by-feature comparison of dbt Core and dbt Cloud: when each wins, the hybrid path, Fusion engine, and the typical migration path.
- Why AI Agents Need MCP Servers for Data Engineering — MCP servers give AI agents structured access to your data tools — Snowflake, BigQuery, dbt, Airflow, and more. Here is why MCP is the int…
- The Complete Guide to Agentic Data Engineering with MCP — Agentic data engineering replaces manual pipeline management with autonomous AI agents. Here is how to implement it with MCP — without lo…
Explore Topic Clusters
- Data Governance: The Complete Guide — Policies, access controls, PII, and compliance at scale.
- Data Catalog: The Complete Guide — Discovery, metadata, lineage, and the modern catalog stack.
- Data Lineage: The Complete Guide — Column-level lineage, impact analysis, and observability.
- Data Quality: The Complete Guide — Tests, SLAs, anomaly detection, and data reliability engineering.
- AI Data Engineering: The Complete Guide — LLMs, agents, and autonomous workflows across the data stack.
- MCP for Data: The Complete Guide — Model Context Protocol servers, tools, and agent integration.
- Data Mesh & Data Fabric: The Complete Guide — Federated ownership, domain-oriented architecture, and interop.
- Open-Source Data Stack: The Complete Guide — dbt, Airflow, Iceberg, DuckDB, and the modern OSS toolkit.