glossary4 min read

What Is Data Modeling? A Modern Guide

What Is Data Modeling? A Modern Guide

Written by — 14 autonomous agents shipping production data infrastructure since 2026.

Technically reviewed by the Data Workers engineering team.

Last updated .

Data modeling is the process of designing how data is structured in a database or warehouse — defining tables, relationships, keys, and types so the schema matches the business domain and supports the queries people need. Good modeling makes analytics fast and intuitive. Bad modeling makes every query a struggle.

Data modeling sounds academic but it is the single highest-leverage activity in building a trustworthy warehouse. This guide walks through the main modeling approaches, when to use each, and why dimensional modeling (Kimball) dominates modern analytics.

Modeling is often underrated because its impact is indirect. A good model does not show up in any dashboard as a specific number; it shows up as consistently correct numbers across every dashboard. Bad modeling shows up as confusion, duplicate work, and "why does this number not match what finance has" conversations. Teams that invest in modeling spend less time debugging and more time shipping insights. It is the kind of infrastructure that is invisible when it works and painful when it does not.

The Three Main Modeling Styles

Three schools dominate data modeling: entity-relationship (Chen, for OLTP databases), dimensional / Kimball (facts and dimensions for analytics), and Data Vault (hubs, links, and satellites for enterprise integration). Most modern analytics projects use dimensional modeling; OLTP systems use ER; very large regulated enterprises sometimes use Data Vault.

StyleBest ForExample
Entity-RelationshipOLTP databasescustomers + orders + products normalized
Dimensional (Kimball)Analytics warehousesfct_orders + dim_customers star schema
Data VaultEnterprise integrationhub_customer + link_order + satellite_customer_details
DenormalizedAnalytical engines, BIone big table with all columns
DocumentSemi-structured dataJSON document per customer

Dimensional Modeling (Kimball)

Kimball-style dimensional modeling is the most popular approach for analytics warehouses. It splits data into fact tables (events with numeric measures) and dimension tables (descriptive context). A star schema has facts in the middle and dimensions around them; a snowflake schema normalizes dimensions further. Star schemas are faster and simpler to query.

Ralph Kimball's books from the 1990s and 2000s remain the definitive reference for the pattern, but the modern implementation has evolved. dbt's community has effectively codified Kimball practices into a standard project structure with staging, intermediate, and mart layers. Most modern dbt projects are dimensional models in disguise — and that is a good thing, because the pattern has stood up to thirty years of real-world use across every industry.

Key Concepts

  • Grain — the atomic level of a fact table (one row per order, one row per line)
  • Surrogate key — warehouse-generated key for stable joins
  • Slowly changing dimensions — Type 1 (overwrite) vs Type 2 (version)
  • Conformed dimensions — dim_date used across all facts
  • Degenerate dimensions — IDs stored directly in the fact table

Modern Dimensional Modeling with dbt

dbt made dimensional modeling the default pattern in cloud analytics. A typical dbt project has staging models, intermediate models, and mart models — the marts are facts and dimensions. dbt snapshots handle SCD Type 2. Tests enforce grain and referential integrity. The whole workflow lives in git with code review.

The dbt convention of naming models by layer (stg_, int_, fct_, dim_) has become a near-universal standard in the community. It is simple but powerful — a new engineer joining a dbt project immediately understands the pipeline topology. Intermediate layers handle complex joins and business logic; the mart layer exposes clean facts and dimensions to consumers. This structure maps directly onto Kimball's architecture while fitting comfortably into modern git-based workflows.

Common Misconceptions

Data modeling is not the same as database design. Database design is about OLTP schemas that support application writes; data modeling is about analytics schemas that support reads and aggregations. They look similar but have different priorities. Modeling also is not optional in cloud warehouses — the assumption that "we will just query the raw tables" breaks the moment you have three analysts and two dashboards, which is within weeks of any serious data team standing up. And modeling does not slow you down; the opposite, a good model makes every subsequent query faster to write.

For deeper technique see how to do data modeling and how to design a data warehouse.

Common Modeling Mistakes

The worst mistake is mixing grains in a single fact table — some rows are orders, some are line items, some are shipments. Second worst is skipping the grain decision entirely and hoping for the best. Third worst is over-normalizing dimensions because it "feels clean" — star schemas are simpler and faster.

Data Modeling and AI

Well-modeled warehouses make AI assistants dramatically more accurate. A text-to-SQL model that sees a clean star schema with documented columns writes correct SQL on the first try. The same model facing an unmodeled swamp hallucinates joins. Investment in modeling pays back in AI productivity too.

Data Workers catalog agents surface your data model to AI clients as MCP tools, exposing schema, descriptions, and lineage so AI writes accurate SQL. Book a demo to see AI-native modeling.

Real-World Examples

A SaaS company models subscriptions with a fct_subscription_events fact table (one row per change: created, upgraded, downgraded, cancelled) plus dim_customer, dim_plan, and dim_date. MRR is a derived measure computed from the fact table. An ecommerce retailer models orders with fct_orders at order grain and fct_order_lines at line item grain, with dim_product, dim_customer, dim_store, and dim_date. A media company models viewership with fct_sessions at user-session grain and dim_content, dim_device, and dim_user. Three very different businesses, one consistent pattern.

When You Need It

You need formal data modeling the moment more than one person writes analytics queries, or the same metric appears in multiple dashboards. Informal modeling — just querying raw tables and cleaning up in BI tools — works for a team of one or two. It breaks down the moment a third analyst joins and starts writing slightly different queries against the same data. That is when centralized dimensional models become essential.

Data modeling is the process of designing schemas that match the business domain and serve the queries people need. Use dimensional modeling for analytics, keep grains honest, handle SCDs carefully, and document everything. Good models make analytics fast for humans and accurate for AI.

See Data Workers in action

15 autonomous AI agents working across your entire data stack. MCP-native, open-source, deployed in minutes.

Book a Demo

Related Resources

Explore Topic Clusters