guideApr 10, 20265 min read

dbt Incremental Models: Strategies, unique_key, and Lookback Windows

Written by The Data Workers Team — 14 autonomous agents shipping production data infrastructure since 2026.

Technically reviewed by the Data Workers engineering team.

Last updated Apr 10, 2026.

dbt incremental models rebuild only new or changed rows instead of recomputing the entire table on every run. They are the main tool for keeping large fact tables fresh without paying for a full refresh every hour. Incremental strategies — append, merge, delete+insert, insert_overwrite — control how new data gets applied to the existing table.

This guide walks through when to use incremental models, how to pick a strategy, the gotchas that burn most teams on their first implementation, and the testing patterns that catch incremental drift before it corrupts your marts. Incremental done wrong is the single most common source of silent data bugs in dbt projects.

What Are Incremental Models?

A normal dbt model rebuilds the target table on every run. For a 100-row dim_users this is fine. For a 10-billion-row fact_events table this is catastrophic — you would spend all day and all your warehouse credits on full rebuilds. Incremental models solve this by computing only the new rows since the last run and merging them into the existing table.

The config block {{ config(materialized='incremental') }} activates the pattern. Inside the model, a {% if is_incremental() %} block filters the source to only new data. On a full refresh, dbt builds the whole table; on subsequent runs, it appends or merges only the delta. The pattern is simple; the gotchas are not.

The Four Incremental Strategies

Strategy	How It Works	Best For
append	Insert new rows, no update	Immutable event logs
merge	UPSERT on unique_key	Mutable state (orders, users)
delete+insert	Delete matching keys, insert new	Warehouses without MERGE
insert_overwrite	Replace partitions wholesale	Partitioned lakes (BigQuery, Databricks)

Picking the Right Strategy

Use append for pure event streams where rows never change — page views, clicks, payments. Use merge when rows can update (order status changes, user profile edits) and your warehouse supports it. Use insert_overwrite for partitioned lake tables where rewriting whole partitions is cheaper than row-level merges.

The choice affects correctness, not just performance. Append on a mutable source silently drops late-arriving updates. Merge on a mis-specified unique_key creates duplicates. Partition overwrites on poorly partitioned tables reprocess too much or too little. Picking wrong is the single biggest incremental footgun.

The unique_key Gotcha

The merge strategy requires a unique_key — the column(s) dbt uses to find existing rows to update. If you leave this wrong, you will either get duplicates (key too narrow) or unintended overwrites (key too broad). Test your incremental merges by running the model twice in a row with the same data; row count should stay constant. If it grows, your unique_key is wrong.

A composite unique_key (list of columns) is the right answer for partitioned tables where id alone is not unique but (id, partition_date) is. dbt supports this natively — just pass a list to unique_key in config.

Late-Arriving Data

The hardest incremental bug is late-arriving data — events that show up an hour after their timestamp says they should. Naive incremental filters of 'where event_time > max(target.event_time)' will skip them. Use a lookback window: 'where event_time > dateadd(hour, -3, max(target.event_time))' to catch rows that arrive late but still process most of the table incrementally.

The right lookback depends on how late your source can be. Most streaming sources are late by seconds; SaaS APIs can be late by hours. Pick a lookback that covers your 99th percentile lateness and monitor for rows older than the window as a separate alert.

Testing and Monitoring

Incremental models need extra tests: row count trends (sudden drops mean broken filters), no duplicates on the unique_key, and freshness against the source. Combine with dbt tests best practices and run full refreshes on a schedule (weekly) to self-heal drift.

Full Refresh Strategy

Even with perfect incremental logic, full refreshes occasionally surface bugs — a row that should have been updated but was not, a late-arriving row that escaped the lookback window. Schedule weekly or monthly full refreshes to self-heal. Most teams run them on weekends when warehouse load is light.

Size matters here. A full refresh on a 10-billion-row table can take hours and cost real money. Partition-level refreshes are a middle ground: rebuild only partitions older than the last week where drift is most likely, not the whole table. dbt's insert_overwrite strategy handles this cleanly on partitioned lake tables.

Cost and Performance

Incremental models typically cut warehouse spend by 10-100x versus full refreshes on large tables. The savings are biggest on append-only event tables and smallest on frequently updated state tables where merge overhead adds up. Monitor the per-run cost and full-refresh cost side by side to make sure your incremental strategy is actually saving money and not just shifting cost to future full refreshes.

Warehouse-specific tuning helps too. Snowflake benefits from clustering keys on the unique_key columns; BigQuery benefits from partition pruning matching your lookback filter; Databricks benefits from Z-order on incremental merge targets. Match the physical layout to your incremental strategy.

Watch out for silent cost creep. A model that started at 100MB per run can grow to 10GB per run as the source grows, and the incremental savings stay flat while the absolute cost climbs. Set alerts on per-model credit consumption and investigate any model whose cost grew more than 2x quarter-over-quarter — that is usually the sign that the incremental filter is drifting wider than intended.

Debugging Incremental Bugs

When an incremental model goes wrong, the symptoms are usually subtle: gradual row count drift, missing late-arriving rows, duplicates after a retry. The fastest debugging approach is to run the model twice with --full-refresh and compare the results — any difference is a bug in the incremental logic. Then narrow down by checking the incremental filter, the unique_key, and the lookback window in that order.

Automating With Agents

Data Workers' pipeline agent detects incremental drift, triggers full refreshes when needed, and refactors merge strategies when schemas evolve — see autonomous data engineering for how agents own incremental health. Book a demo to watch it run.

Incremental models are how dbt scales to billions of rows. Pick the strategy that matches your data's mutability, set a lookback window, add row-count tests, and schedule periodic full refreshes — then let agents watch for drift so you do not have to hand-tune merge logic forever.

Sources

See Data Workers in action

15 autonomous AI agents working across your entire data stack. MCP-native, open-source, deployed in minutes.

Book a Demo

Related Resources

Why Your dbt Semantic Layer Needs an Agent Layer on Top — The dbt semantic layer is the best way to define metrics. But definitions alone don't prevent incidents or optimize queries. An agent lay…
Claude Code + Snowflake/BigQuery/dbt: Integration Patterns for Data Teams — Practical integration patterns: Snowflake CLI + MCP, BigQuery MCP server, dbt MCP server with Claude Code.
Data Engineering with dbt: The Modern Workflow — Covers dbt's role in modern data stacks, project structure, best practices, and automation.
dbt Tests Best Practices: PKs, FKs, Severity, and CI — Best practices for dbt tests at scale: non-negotiables, severity config, CI integration, and organizing tests past 500 models.
dbt Snapshots Explained: SCD Type 2 in Five Lines of YAML — Guide to dbt snapshots: timestamp vs check strategy, hard deletes, scaling considerations, and why never full-refresh.
dbt Alternatives in 2026: When Analytics Engineering Needs More — dbt is the analytics engineering standard. But Fivetran merger pricing, limited real-time support, and growing agent needs are driving te…
dbt vs Dataform: Which SQL Transform Tool Wins? — Contrasts dbt (ecosystem leader) with Dataform (BigQuery-native), covers migration and the right tool per stack.
dbt Cloud vs dbt Core: Feature and Pricing Comparison — Feature-by-feature comparison of dbt Core and dbt Cloud: when each wins, the hybrid path, Fusion engine, and the typical migration path.
Why AI Agents Need MCP Servers for Data Engineering — MCP servers give AI agents structured access to your data tools — Snowflake, BigQuery, dbt, Airflow, and more. Here is why MCP is the int…
The Complete Guide to Agentic Data Engineering with MCP — Agentic data engineering replaces manual pipeline management with autonomous AI agents. Here is how to implement it with MCP — without lo…
How AI Agents Cut Snowflake Costs by 40% Without Manual Tuning — Most Snowflake environments waste 30-40% of compute on zombie tables, oversized warehouses, and unoptimized queries. AI agents find and f…
RBAC for Data Engineering Teams: Why Manual Access Control Doesn't Scale — Manual RBAC breaks down at 50+ data assets. Policy drift, orphaned permissions, and PII exposure become inevitable. AI agents enforce gov…

Explore Topic Clusters

Data Governance: The Complete Guide — Policies, access controls, PII, and compliance at scale.
Data Catalog: The Complete Guide — Discovery, metadata, lineage, and the modern catalog stack.
Data Lineage: The Complete Guide — Column-level lineage, impact analysis, and observability.
Data Quality: The Complete Guide — Tests, SLAs, anomaly detection, and data reliability engineering.
AI Data Engineering: The Complete Guide — LLMs, agents, and autonomous workflows across the data stack.
MCP for Data: The Complete Guide — Model Context Protocol servers, tools, and agent integration.
Data Mesh & Data Fabric: The Complete Guide — Federated ownership, domain-oriented architecture, and interop.
Open-Source Data Stack: The Complete Guide — dbt, Airflow, Iceberg, DuckDB, and the modern OSS toolkit.