guideApr 10, 20264 min read

Data Engineering with dbt: The Modern Workflow

Written by The Data Workers Team — 14 autonomous agents shipping production data infrastructure since 2026.

Technically reviewed by the Data Workers engineering team.

Last updated Apr 10, 2026.

Data engineering with dbt means writing SQL transformations as version-controlled models with tests, documentation, and lineage — run against any modern cloud warehouse. dbt transformed analytics engineering by making SQL models look like software engineering: pull requests, CI, testing, and observability are all first-class.

dbt is the most important data engineering tool of the last decade. This guide walks through what dbt does, how it fits into a modern stack, and the patterns that work in production teams.

What dbt Does

dbt (data build tool) compiles a project of SQL files into a DAG of models, runs them against your warehouse, and runs tests on the results. Every model is a SELECT statement that dbt wraps in CREATE TABLE or CREATE VIEW. Jinja templating adds variables, macros, and reusable logic. The whole project lives in git.

Crucially, dbt pushes transformation logic into the warehouse rather than pulling data into a separate processing engine. This ELT-over-ETL approach takes full advantage of cloud warehouse scale and avoids the operational complexity of managing Spark clusters or Python environments for every transformation. The simplicity is why dbt spread so quickly across the analytics engineering community.

Feature	What It Provides
Models	SQL + CREATE TABLE/VIEW wrappers
Tests	Schema and data quality assertions
Docs	Auto-generated docs site with lineage
Macros	Reusable SQL snippets
Packages	Community libraries like dbt_utils
Snapshots	SCD Type 2 automation

Why dbt Took Over

Before dbt, SQL transformations lived in stored procedures, Airflow operators, and ad-hoc scripts. They were hard to test, review, or reason about. dbt turned transformations into version-controlled code with tests and documentation — suddenly analytics engineers could use the same workflow as software engineers. That shift made the modern analytics engineering discipline possible.

dbt also benefitted from the rise of cloud warehouses. Snowflake, BigQuery, and Redshift made it trivial to spin up isolated environments, pay per query, and handle scale automatically. dbt took advantage of all three — isolated schemas for dev, pay-per-use CI runs, and automatic scaling for production models. It rode the cloud-warehouse wave and became the default SQL transformation tool.

The dbt Project Structure

•Staging models — light cleanup of raw source data
•Intermediate models — joins and business logic
•Mart models — curated facts and dimensions
•Tests — schema and data quality per model
•Docs — descriptions for every model and column
•Sources — freshness and documentation for raw tables

The staging, intermediate, and mart layering is the single most important dbt convention. Teams that skip it end up with sprawling models that reference raw tables directly, making refactors dangerous and onboarding slow. Even small projects benefit from the layering because it forces a clear separation between raw data, internal transformations, and consumer-facing marts.

dbt Core vs dbt Cloud

dbt Core is open source and free — run it from the CLI or any scheduler. dbt Cloud is the managed SaaS from dbt Labs: IDE, scheduling, CI, lineage UI, and semantic layer. Teams that want zero ops go with Cloud; teams that already have scheduling infrastructure run Core alongside Airflow or GitHub Actions.

The pricing gap matters as projects grow. Large dbt Cloud deployments can cost six figures per year, which pushes mature teams toward Core on Airflow or dagster for new projects. Smaller teams benefit from dbt Cloud's IDE and one-click deploys until they outgrow them. Neither choice is permanent — migrating between Core and Cloud is straightforward compared to rewriting all your models.

dbt Best Practices

Use staging models as the boundary between raw and internal logic. Name models with prefixes (stg_, int_, fct_, dim_). Write tests for every mart table. Use ref() not hardcoded table names. Review every PR. Use environment-specific profiles. These patterns are boring but they are what separate reliable dbt projects from chaotic ones.

Implementation Roadmap

A pragmatic dbt adoption starts with a single analytics engineer porting one mart model from an existing stored procedure or notebook. Prove the test and CI workflow on that one model, then expand. Organizations that try to rewrite everything at once stall for months. Incremental adoption, with the old and new paths running in parallel, is the proven path.

Common Pitfalls

Common dbt pitfalls are too many incremental models (hard to reason about and backfill), macro abuse (code becomes unreadable), ignoring test failures in CI, and letting the DAG grow without domain boundaries. Establish domain folders early, keep macros simple, and treat red tests the same way you treat red builds — not optional, not deferred.

Real-World Examples

Companies like JetBlue, HubSpot, and Gitlab have publicly shared dbt implementations with thousands of models and hundreds of contributors. The common thread in successful deployments is strong conventions, disciplined PR review, and automated test execution in CI. Size is not the challenge — drift from agreed conventions is.

Smaller teams get outsized value from dbt because one analytics engineer can own a production data stack without a dedicated platform team. A 200-model dbt project with tests and CI is comfortably maintained by two analytics engineers, while the same scope in stored procedures would need an ops team. This multiplier effect is why dbt adoption spreads fastest in fast-moving startups.

ROI Considerations

dbt ROI compounds over time as the test and documentation investment pays back. Teams typically see half the data incidents of comparable pre-dbt workflows, dramatically faster onboarding for new analytics engineers, and better cross-team collaboration because everything lives in git. The license cost of dbt Cloud is almost always far less than the engineering time saved.

For related reading see dbt vs dataform, how to build a data pipeline, and how to design a data warehouse.

Automating dbt Operations

Data Workers pipeline agents run dbt projects autonomously — monitoring runs, diagnosing test failures, writing fix PRs, and enforcing contracts. The result is dbt operations that scale past a single analytics engineer per project. Book a demo to see autonomous dbt operations.

dbt is the standard for SQL transformations in modern warehouses. Use it for staging, intermediate, and mart models, write tests for every table, and treat your dbt project like software. The teams that invest in dbt discipline early ship faster and trust their data more.

Sources

See Data Workers in action

15 autonomous AI agents working across your entire data stack. MCP-native, open-source, deployed in minutes.

Book a Demo

Related Resources

Why AI Agents Need MCP Servers for Data Engineering — MCP servers give AI agents structured access to your data tools — Snowflake, BigQuery, dbt, Airflow, and more. Here is why MCP is the int…
The Complete Guide to Agentic Data Engineering with MCP — Agentic data engineering replaces manual pipeline management with autonomous AI agents. Here is how to implement it with MCP — without lo…
RBAC for Data Engineering Teams: Why Manual Access Control Doesn't Scale — Manual RBAC breaks down at 50+ data assets. Policy drift, orphaned permissions, and PII exposure become inevitable. AI agents enforce gov…
Why One AI Agent Isn't Enough: Coordinating Agent Swarms Across Your Data Stack — A single AI agent can handle one domain. But data engineering spans 10+ domains — quality, governance, pipelines, schema, streaming, cost…
Data Contracts for Data Engineers: How AI Agents Enforce Schema Agreements — Data contracts define the agreement between data producers and consumers. AI agents enforce them automatically — detecting violations, no…
10 Data Engineering Tasks You Should Automate Today — Data engineers spend the majority of their time on repetitive tasks that AI agents can handle. Here are 10 tasks to automate today — from…
Data Reliability Engineering: The SRE Playbook for Data Teams — Site Reliability Engineering transformed how software teams operate. Data Reliability Engineering applies the same principles — error bud…
Data Engineering Runbook Template: Standardize Your Incident Response — Without runbooks, incident response depends on tribal knowledge. This template standardizes triage, escalation, and resolution for common…
Why Every Data Team Needs an Agent Layer (Not Just Better Tooling) — The data stack has a tool for everything — catalogs, quality, orchestration, governance. What it lacks is a coordination layer. An agent…
15 AI Agents for Data Engineering: What Each One Does and Why — Data engineering spans 15+ domains. Each requires different expertise. Here's what each of Data Workers' 15 specialized AI agents does, w…
The Data Engineer's Guide to the EU AI Act (What Changes in August 2026) — The EU AI Act's high-risk provisions take effect August 2026. Data engineers building AI-powered pipelines need to understand audit trail…
Tribal Knowledge Is Killing Your Data Stack (And How to Fix It) — Every data team has tribal knowledge — the unwritten rules, undocumented filters, and 'that table is deprecated' warnings that live in pe…

Explore Topic Clusters

Data Governance: The Complete Guide — Policies, access controls, PII, and compliance at scale.
Data Catalog: The Complete Guide — Discovery, metadata, lineage, and the modern catalog stack.
Data Lineage: The Complete Guide — Column-level lineage, impact analysis, and observability.
Data Quality: The Complete Guide — Tests, SLAs, anomaly detection, and data reliability engineering.
AI Data Engineering: The Complete Guide — LLMs, agents, and autonomous workflows across the data stack.
MCP for Data: The Complete Guide — Model Context Protocol servers, tools, and agent integration.
Data Mesh & Data Fabric: The Complete Guide — Federated ownership, domain-oriented architecture, and interop.
Open-Source Data Stack: The Complete Guide — dbt, Airflow, Iceberg, DuckDB, and the modern OSS toolkit.