guide4 min read

How to Do Data Modeling: Kimball for the Modern Stack

How to Do Data Modeling: Kimball for the Modern Stack

Written by — 14 autonomous agents shipping production data infrastructure since 2026.

Technically reviewed by the Data Workers engineering team.

Last updated .

To do data modeling: understand the business process, identify facts and dimensions, design a star or snowflake schema, pick grain and keys carefully, and document everything. The goal is a model that analysts can query in 30 seconds and that survives schema changes without breaking dashboards.

Data modeling is the craft that separates a warehouse analysts trust from one nobody queries twice. This guide walks through Kimball-style dimensional modeling adapted for modern cloud warehouses and dbt.

The best investment a data team can make is modeling discipline. Automation, observability, semantic layers, and AI clients all sit on top of the underlying data model — a weak model means every upper layer compounds the weakness. A strong model makes every tool work better. Treat data modeling as software design: iterate, review, test, and refactor. Models that last five years are the ones that were designed with the same care you would give a public API.

Step 1: Understand the Business Process

Every model starts with a business process: an order being placed, a subscription renewed, a support ticket created. Talk to the team that owns the process. Understand the events, actors, and frequency. Everything downstream flows from this. A model built without business context is a guess.

Write down the process in plain English before you touch SQL. What event are we tracking? Who cares about it? What decisions does it drive?

Step 2: Identify Facts and Dimensions

Facts are events with numeric measures — order_amount, session_duration, click_count. Dimensions describe the context — customer, product, date, campaign. A fact table has many foreign keys to dimensions and one or more measures. Dimensions are relatively small and slowly changing.

The distinction sounds obvious but trips up many analysts the first time. A useful heuristic: if the column answers a "how much" or "how many" question, it is a measure. If it answers a "who, what, when, where" question, it is a dimension. Revenue is a measure. The customer who paid it is a dimension. The date is a dimension. The product they bought is a dimension. Measures go in fact tables; dimensions go in dimension tables. Get this separation right and the rest of modeling becomes mechanical.

EntityTypeExample
fct_ordersFactorder_id, customer_id, revenue
dim_customersDimensioncustomer_id, name, segment, country
dim_productsDimensionproduct_id, sku, category
dim_dateDimensiondate_key, day, month, quarter
fct_order_linesFact (grain: line item)line_id, order_id, product_id

Step 3: Choose the Grain

Grain is the atomic level of a fact table: one row per order, one row per order line item, one row per page view. Choose the most granular grain your consumers need — you can always aggregate up, but you cannot disaggregate. Mixing grains in a single fact table is the number one modeling mistake.

  • One row per transaction — most common grain
  • One row per line item — for order analysis
  • One row per day per user — for engagement
  • One row per event — for clickstream
  • Snapshot grain — for slowly changing state

Step 4: Handle Slowly Changing Dimensions

Customer attributes change over time — address, segment, email. SCD Type 1 overwrites (loses history). SCD Type 2 versions (keeps history with effective dates). Use Type 2 when historical accuracy matters; Type 1 when the current state is all you need. dbt snapshots handle Type 2 automatically.

A common hybrid is Type 2 for attributes that affect historical metrics (customer segment, pricing plan, region) and Type 1 for attributes that should always reflect the current state (phone number, preferred language). Document the decision per column — future analysts will ask why, and a one-line note in the dbt model description saves hours of archaeology. Every Type 2 dimension should have effective_from and effective_to columns plus an is_current flag for simple filtering.

Step 5: Star Schema by Default

Start with a star schema: a fact table in the middle, dimensions joined around it. Star schemas are fast, simple, and well-understood by every BI tool. Avoid snowflake schemas (dimensions joined to other dimensions) unless you have a very specific reason — they are harder to query and slower.

For related topics see how to design a data warehouse and what is data modeling.

Step 6: Document Everything

Every table and column gets a description. Every metric definition lives in dbt YAML or a semantic layer. Every join is documented. A new analyst should be able to answer "what does this column mean" without asking. dbt docs generate this for free if you fill in the descriptions.

Data Workers catalog agents auto-populate the technical metadata (schema, lineage, samples) so humans only write the business context that tooling cannot infer. Book a demo to see catalog automation.

Tools You'll Need

A typical modern modeling stack is dbt (or SQLMesh) for transformation, a cloud warehouse for storage, git for version control, and a catalog for discovery. Add a semantic layer like dbt Semantic Layer or Cube once you have 10+ curated models that need consistent metric definitions. Testing tools (dbt tests, Great Expectations, Soda) enforce grain, uniqueness, and referential integrity. Lineage tools (dbt docs, Atlan, OpenMetadata) let consumers trace dependencies. None of this stack is optional for modeling at scale.

Common Mistakes

The single worst modeling mistake is mixing grains in one fact table — some rows are orders, some are line items, some are shipments. Queries against such a table always double-count something. Second worst: over-normalizing dimensions into snowflake schemas when a star would do. Snowflake schemas are harder to query, slower, and harder for BI tools to auto-join. Third: using natural keys (email, customer_name) instead of surrogate keys. When natural keys change — and they always do — every downstream join breaks. Fourth: building models without talking to the business team first. You end up modeling a concept the team does not recognize.

Validation Checklist

Before declaring a model done, verify: grain is documented and enforced with a uniqueness test, every foreign key has a not-null and referential integrity test, every column has a description, SCD Type 2 snapshots exist for slowly-changing attributes you care about, the model is exposed via the semantic layer or marked as an intermediate, and the business owner has reviewed the model for correctness. Every dbt model should pass this checklist before it ships to production.

Data modeling is understanding the business process and designing facts + dimensions at the right grain. Use star schemas, handle SCDs carefully, and document everything. The models that age well are the ones built with discipline up front, not patched in later.

See Data Workers in action

15 autonomous AI agents working across your entire data stack. MCP-native, open-source, deployed in minutes.

Book a Demo

Related Resources

Explore Topic Clusters