What Is Data Modeling? A Modern Guide
What Is Data Modeling? A Modern Guide
Written by The Data Workers Team — 14 autonomous agents shipping production data infrastructure since 2026.
Technically reviewed by the Data Workers engineering team.
Last updated .
Data modeling is the process of designing how data is structured in a database or warehouse — defining tables, relationships, keys, and types so the schema matches the business domain and supports the queries people need. Good modeling makes analytics fast and intuitive. Bad modeling makes every query a struggle.
Data modeling sounds academic but it is the single highest-leverage activity in building a trustworthy warehouse. This guide walks through the main modeling approaches, when to use each, and why dimensional modeling (Kimball) dominates modern analytics.
Modeling is often underrated because its impact is indirect. A good model does not show up in any dashboard as a specific number; it shows up as consistently correct numbers across every dashboard. Bad modeling shows up as confusion, duplicate work, and "why does this number not match what finance has" conversations. Teams that invest in modeling spend less time debugging and more time shipping insights. It is the kind of infrastructure that is invisible when it works and painful when it does not.
The Three Main Modeling Styles
Three schools dominate data modeling: entity-relationship (Chen, for OLTP databases), dimensional / Kimball (facts and dimensions for analytics), and Data Vault (hubs, links, and satellites for enterprise integration). Most modern analytics projects use dimensional modeling; OLTP systems use ER; very large regulated enterprises sometimes use Data Vault.
| Style | Best For | Example |
|---|---|---|
| Entity-Relationship | OLTP databases | customers + orders + products normalized |
| Dimensional (Kimball) | Analytics warehouses | fct_orders + dim_customers star schema |
| Data Vault | Enterprise integration | hub_customer + link_order + satellite_customer_details |
| Denormalized | Analytical engines, BI | one big table with all columns |
| Document | Semi-structured data | JSON document per customer |
Dimensional Modeling (Kimball)
Kimball-style dimensional modeling is the most popular approach for analytics warehouses. It splits data into fact tables (events with numeric measures) and dimension tables (descriptive context). A star schema has facts in the middle and dimensions around them; a snowflake schema normalizes dimensions further. Star schemas are faster and simpler to query.
Ralph Kimball's books from the 1990s and 2000s remain the definitive reference for the pattern, but the modern implementation has evolved. dbt's community has effectively codified Kimball practices into a standard project structure with staging, intermediate, and mart layers. Most modern dbt projects are dimensional models in disguise — and that is a good thing, because the pattern has stood up to thirty years of real-world use across every industry.
Key Concepts
- •Grain — the atomic level of a fact table (one row per order, one row per line)
- •Surrogate key — warehouse-generated key for stable joins
- •Slowly changing dimensions — Type 1 (overwrite) vs Type 2 (version)
- •Conformed dimensions — dim_date used across all facts
- •Degenerate dimensions — IDs stored directly in the fact table
Modern Dimensional Modeling with dbt
dbt made dimensional modeling the default pattern in cloud analytics. A typical dbt project has staging models, intermediate models, and mart models — the marts are facts and dimensions. dbt snapshots handle SCD Type 2. Tests enforce grain and referential integrity. The whole workflow lives in git with code review.
The dbt convention of naming models by layer (stg_, int_, fct_, dim_) has become a near-universal standard in the community. It is simple but powerful — a new engineer joining a dbt project immediately understands the pipeline topology. Intermediate layers handle complex joins and business logic; the mart layer exposes clean facts and dimensions to consumers. This structure maps directly onto Kimball's architecture while fitting comfortably into modern git-based workflows.
Common Misconceptions
Data modeling is not the same as database design. Database design is about OLTP schemas that support application writes; data modeling is about analytics schemas that support reads and aggregations. They look similar but have different priorities. Modeling also is not optional in cloud warehouses — the assumption that "we will just query the raw tables" breaks the moment you have three analysts and two dashboards, which is within weeks of any serious data team standing up. And modeling does not slow you down; the opposite, a good model makes every subsequent query faster to write.
For deeper technique see how to do data modeling and how to design a data warehouse.
Common Modeling Mistakes
The worst mistake is mixing grains in a single fact table — some rows are orders, some are line items, some are shipments. Second worst is skipping the grain decision entirely and hoping for the best. Third worst is over-normalizing dimensions because it "feels clean" — star schemas are simpler and faster.
Data Modeling and AI
Well-modeled warehouses make AI assistants dramatically more accurate. A text-to-SQL model that sees a clean star schema with documented columns writes correct SQL on the first try. The same model facing an unmodeled swamp hallucinates joins. Investment in modeling pays back in AI productivity too.
Data Workers catalog agents surface your data model to AI clients as MCP tools, exposing schema, descriptions, and lineage so AI writes accurate SQL. Book a demo to see AI-native modeling.
Real-World Examples
A SaaS company models subscriptions with a fct_subscription_events fact table (one row per change: created, upgraded, downgraded, cancelled) plus dim_customer, dim_plan, and dim_date. MRR is a derived measure computed from the fact table. An ecommerce retailer models orders with fct_orders at order grain and fct_order_lines at line item grain, with dim_product, dim_customer, dim_store, and dim_date. A media company models viewership with fct_sessions at user-session grain and dim_content, dim_device, and dim_user. Three very different businesses, one consistent pattern.
When You Need It
You need formal data modeling the moment more than one person writes analytics queries, or the same metric appears in multiple dashboards. Informal modeling — just querying raw tables and cleaning up in BI tools — works for a team of one or two. It breaks down the moment a third analyst joins and starts writing slightly different queries against the same data. That is when centralized dimensional models become essential.
Data modeling is the process of designing schemas that match the business domain and serve the queries people need. Use dimensional modeling for analytics, keep grains honest, handle SCDs carefully, and document everything. Good models make analytics fast for humans and accurate for AI.
Further Reading
Sources
See Data Workers in action
15 autonomous AI agents working across your entire data stack. MCP-native, open-source, deployed in minutes.
Book a DemoRelated Resources
- How to Do Data Modeling: Kimball for the Modern Stack — Walks through Kimball-style dimensional modeling adapted for modern cloud warehouses and dbt.
- Data Vault Modeling Guide: Hubs, Links, Satellites — Deep guide to Data Vault modeling: hub/link/satellite patterns, Raw vs Business Vault, hash keys, PIT tables, and rollout strategy.
- What is Data Observability? The Data Engineer's Complete Guide — Data observability provides visibility into data health across your stack. This guide covers the five pillars, tool landscape, and how AI…
- Meta Data Meaning: Definition, Examples, and Why It Matters — Plain-language definition of meta data with examples and use cases for analysts, engineers, auditors, and AI agents.
- What Is Data Governance With Example: A Practical Guide — Real-world data governance examples from healthcare PHI, banking BCBS 239, and ecommerce GDPR with shared design principles.
- What Is Data Modernization? A 2026 Strategy Guide — Strategy guide covering the four phases of data modernization, common pitfalls, and how to make data AI-ready in 2026.
- What Is a Data Domain? Definition and Examples for Data Mesh — Guide to identifying data domains, using them in data mesh, and applying domain ownership in centralized stacks.
- What Is Data Transparency? Definition and Best Practices — Guide to data transparency including the five characteristics of transparent systems and how AI-native catalogs make transparency automatic.
- What Is Spatial Data? Definition, Types, and Examples — Spatial data primer covering vector vs raster types, common formats, spatial queries in modern warehouses, and quality issues.
- What Is Stale Data? Definition, Detection, and Prevention — Guide to identifying, detecting, and preventing stale data in pipelines with SLA contracts and active monitoring strategies.
- What Is Data Enablement? Definition and Strategy Guide — Strategy guide for data enablement programs covering access, literacy, trust, and tooling pillars.
- What Is a Data Pipeline? Complete 2026 Guide — Defines data pipelines and walks through the three stages, batch vs streaming, and modern tooling.
Explore Topic Clusters
- Data Governance: The Complete Guide — Policies, access controls, PII, and compliance at scale.
- Data Catalog: The Complete Guide — Discovery, metadata, lineage, and the modern catalog stack.
- Data Lineage: The Complete Guide — Column-level lineage, impact analysis, and observability.
- Data Quality: The Complete Guide — Tests, SLAs, anomaly detection, and data reliability engineering.
- AI Data Engineering: The Complete Guide — LLMs, agents, and autonomous workflows across the data stack.
- MCP for Data: The Complete Guide — Model Context Protocol servers, tools, and agent integration.
- Data Mesh & Data Fabric: The Complete Guide — Federated ownership, domain-oriented architecture, and interop.
- Open-Source Data Stack: The Complete Guide — dbt, Airflow, Iceberg, DuckDB, and the modern OSS toolkit.