guideLast updated Apr 10, 20264 min read

How to Do Data Modeling: Kimball for the Modern Stack

To do data modeling: understand the business process, identify facts and dimensions, design a star or snowflake schema, pick grain and keys carefully, and document everything. The goal is a model that analysts can query in 30 seconds and that survives schema changes without breaking dashboards.

Data modeling is the craft that separates a warehouse analysts trust from one nobody queries twice. This guide walks through Kimball-style dimensional modeling adapted for modern cloud warehouses and dbt.

The best investment a data team can make is modeling discipline. Automation, observability, semantic layers, and AI clients all sit on top of the underlying data model — a weak model means every upper layer compounds the weakness. A strong model makes every tool work better. Treat data modeling as software design: iterate, review, test, and refactor. Models that last five years are the ones that were designed with the same care you would give a public API.

Step 1: Understand the Business Process

Every model starts with a business process: an order being placed, a subscription renewed, a support ticket created. Talk to the team that owns the process. Understand the events, actors, and frequency. Everything downstream flows from this. A model built without business context is a guess.

Write down the process in plain English before you touch SQL. What event are we tracking? Who cares about it? What decisions does it drive?

Step 2: Identify Facts and Dimensions

Facts are events with numeric measures — order_amount, session_duration, click_count. Dimensions describe the context — customer, product, date, campaign. A fact table has many foreign keys to dimensions and one or more measures. Dimensions are relatively small and slowly changing.

The distinction sounds obvious but trips up many analysts the first time. A useful heuristic: if the column answers a "how much" or "how many" question, it is a measure. If it answers a "who, what, when, where" question, it is a dimension. Revenue is a measure. The customer who paid it is a dimension. The date is a dimension. The product they bought is a dimension. Measures go in fact tables; dimensions go in dimension tables. Get this separation right and the rest of modeling becomes mechanical.

Entity	Type	Example
fct_orders	Fact	order_id, customer_id, revenue
dim_customers	Dimension	customer_id, name, segment, country
dim_products	Dimension	product_id, sku, category
dim_date	Dimension	date_key, day, month, quarter
fct_order_lines	Fact (grain: line item)	line_id, order_id, product_id

Step 3: Choose the Grain

Grain is the atomic level of a fact table: one row per order, one row per order line item, one row per page view. Choose the most granular grain your consumers need — you can always aggregate up, but you cannot disaggregate. Mixing grains in a single fact table is the number one modeling mistake.

•One row per transaction — most common grain
•One row per line item — for order analysis
•One row per day per user — for engagement
•One row per event — for clickstream
•Snapshot grain — for slowly changing state

Step 4: Handle Slowly Changing Dimensions

Customer attributes change over time — address, segment, email. SCD Type 1 overwrites (loses history). SCD Type 2 versions (keeps history with effective dates). Use Type 2 when historical accuracy matters; Type 1 when the current state is all you need. dbt snapshots handle Type 2 automatically.

A common hybrid is Type 2 for attributes that affect historical metrics (customer segment, pricing plan, region) and Type 1 for attributes that should always reflect the current state (phone number, preferred language). Document the decision per column — future analysts will ask why, and a one-line note in the dbt model description saves hours of archaeology. Every Type 2 dimension should have effective_from and effective_to columns plus an is_current flag for simple filtering.

Step 5: Star Schema by Default

Start with a star schema: a fact table in the middle, dimensions joined around it. Star schemas are fast, simple, and well-understood by every BI tool. Avoid snowflake schemas (dimensions joined to other dimensions) unless you have a very specific reason — they are harder to query and slower.

For related topics see how to design a data warehouse and what is data modeling.

Step 6: Document Everything

Every table and column gets a description. Every metric definition lives in dbt YAML or a semantic layer. Every join is documented. A new analyst should be able to answer "what does this column mean" without asking. dbt docs generate this for free if you fill in the descriptions.

Data Workers catalog agents auto-populate the technical metadata (schema, lineage, samples) so humans only write the business context that tooling cannot infer. Book a demo to see catalog automation.

Tools You'll Need

A typical modern modeling stack is dbt (or SQLMesh) for transformation, a cloud warehouse for storage, git for version control, and a catalog for discovery. Add a semantic layer like dbt Semantic Layer or Cube once you have 10+ curated models that need consistent metric definitions. Testing tools (dbt tests, Great Expectations, Soda) enforce grain, uniqueness, and referential integrity. Lineage tools (dbt docs, Atlan, OpenMetadata) let consumers trace dependencies. None of this stack is optional for modeling at scale.

Common Mistakes

The single worst modeling mistake is mixing grains in one fact table — some rows are orders, some are line items, some are shipments. Queries against such a table always double-count something. Second worst: over-normalizing dimensions into snowflake schemas when a star would do. Snowflake schemas are harder to query, slower, and harder for BI tools to auto-join. Third: using natural keys (email, customer_name) instead of surrogate keys. When natural keys change — and they always do — every downstream join breaks. Fourth: building models without talking to the business team first. You end up modeling a concept the team does not recognize.

Validation Checklist

Before declaring a model done, verify: grain is documented and enforced with a uniqueness test, every foreign key has a not-null and referential integrity test, every column has a description, SCD Type 2 snapshots exist for slowly-changing attributes you care about, the model is exposed via the semantic layer or marked as an intermediate, and the business owner has reviewed the model for correctness. Every dbt model should pass this checklist before it ships to production.

Data modeling is understanding the business process and designing facts + dimensions at the right grain. Use star schemas, handle SCDs carefully, and document everything. The models that age well are the ones built with discipline up front, not patched in later.

See Data Workers in action

15 autonomous AI agents working across your entire data stack. MCP-native, open-source, deployed in minutes.

Book a Demo

Related Resources

Data Vault Modeling Guide: Hubs, Links, Satellites — Deep guide to Data Vault modeling: hub/link/satellite patterns, Raw vs Business Vault, hash keys, PIT tables, and rollout strategy.
What Is Data Modeling? A Modern Guide — Defines data modeling and walks through the main modeling schools and modern dbt-based practice.
Best Practices for Claude Code in Data Pipelines — Discover effective practices for optimizing Claude Code in your data pipelines with our detailed listicle format.
How to Use MCP to Automate Data Workflows — Explore how the Model Context Protocol (MCP) can be used to automate and optimize your data workflows, increasing efficiency and reducing…
How to Ensure Data Quality in Your MCP Implementations — Explore effective strategies to ensure data quality in your MCP implementations. Learn best practices to maintain accuracy and reliability.
Why AI Agents Need MCP Servers for Data Engineering — MCP servers give AI agents structured access to your data tools — Snowflake, BigQuery, dbt, Airflow, and more. Here is why MCP is the int…
The Complete Guide to Agentic Data Engineering with MCP — Agentic data engineering replaces manual pipeline management with autonomous AI agents. Here is how to implement it with MCP — without lo…
RBAC for Data Engineering Teams: Why Manual Access Control Doesn't Scale — Manual RBAC breaks down at 50+ data assets. Policy drift, orphaned permissions, and PII exposure become inevitable. AI agents enforce gov…
From Alert to Resolution in Minutes: How AI Agents Debug Data Pipeline Incidents — The average data pipeline incident takes 4-8 hours to resolve. AI agents that understand your full data context can auto-diagnose and res…
Build Data Pipelines with AI: From Description to Deployment in Minutes — Building a data pipeline still takes 2-6 weeks of engineering time. AI agents that understand your data context can generate, test, and d…
Why Your Data Catalog Is Always Out of Date (And How AI Agents Fix It) — 40-60% of data catalog entries are outdated at any given time. AI agents that continuously scan, classify, and update metadata make the s…
Data Migration Automation: How AI Agents Reduce 18-Month Timelines to Weeks — Enterprise data migrations take 6-18 months because schema mapping, data validation, and downtime coordination are manual. AI agents comp…

Explore Topic Clusters

Data Governance: The Complete Guide — Policies, access controls, PII, and compliance at scale.
Data Catalog: The Complete Guide — Discovery, metadata, lineage, and the modern catalog stack.
Data Lineage: The Complete Guide — Column-level lineage, impact analysis, and observability.
Data Quality: The Complete Guide — Tests, SLAs, anomaly detection, and data reliability engineering.
AI Data Engineering: The Complete Guide — LLMs, agents, and autonomous workflows across the data stack.
MCP for Data: The Complete Guide — Model Context Protocol servers, tools, and agent integration.
Data Mesh & Data Fabric: The Complete Guide — Federated ownership, domain-oriented architecture, and interop.
Open-Source Data Stack: The Complete Guide — dbt, Airflow, Iceberg, DuckDB, and the modern OSS toolkit.
AI for Data Infra — The complete category for AI agents built specifically for data engineering, data governance, and data infrastructure work.