guide5 min read

Data Vault Modeling Guide: Hubs, Links, Satellites

Data Vault Modeling Guide: Hubs, Links, Satellites

Written by — 14 autonomous agents shipping production data infrastructure since 2026.

Technically reviewed by the Data Workers engineering team.

Last updated .

Data Vault is a warehouse modeling methodology designed for auditable integration of many source systems. It uses three core constructs — hubs for business keys, links for relationships, and satellites for descriptive attributes with full history. The result is an insert-only, highly parallel, audit-friendly data layer that can absorb new sources without refactoring.

This guide walks through the hub/link/satellite pattern, the Raw Vault vs Business Vault split, the tooling that makes Data Vault practical in 2026, and the specific organizational conditions under which the methodology actually pays off instead of adding bureaucracy.

Why Data Vault Exists

Dan Linstedt created Data Vault in the early 2000s for Lockheed Martin, where Kimball's dimensional modeling could not handle the integration of dozens of source systems under audit pressure. Every change to a Kimball star forced refactors across many tables; Data Vault's insert-only model absorbed change without touching existing tables.

Two decades later, Data Vault is the standard for enterprise data warehouses in banking, insurance, pharma, and government — anywhere regulation demands full traceability and source systems change faster than a central team can keep up. The methodology's durability comes from its decoupling of ingestion (Raw Vault) from modeling (marts), which lets each evolve on its own schedule.

The Three Constructs

ConstructStoresKey Columns
HubDistinct business keyshub_key, business_key, load_date, record_source
LinkRelationships between hubslink_key, hub_keys, load_date, record_source
SatelliteDescriptive attributes + historyhub_key, load_date, hash_diff, attributes

Every real-world entity (customer, order, product) becomes a hub. Every relationship (customer places order) becomes a link. Every set of attributes that can change (customer address, order status) becomes a satellite. New sources add new hubs, links, and satellites — existing ones are never modified. This insert-only discipline is the core of Data Vault's durability.

Raw Vault vs Business Vault

The Raw Vault captures sources as-is with no transformations — every field lands in a satellite unchanged. The Business Vault sits on top and contains derived attributes, computed metrics, and business-key rationalization (e.g., matching customers across CRM and billing). The split gives you an unambiguous audit trail in Raw while allowing business logic in Business.

Downstream of both, Kimball-style marts serve BI workloads. This three-layer pattern — Raw Vault → Business Vault → marts — is the dominant enterprise architecture in 2026, combining Data Vault's flexibility with Kimball's query speed for analysts.

Hash Keys and Hash Diffs

Modern Data Vault uses hash keys (SHA-1 of business keys) instead of sequence surrogate keys. Hashes can be computed in parallel without coordination, so loads scale horizontally across warehouses. Hash_diff is a hash of all satellite attributes used to detect changes — if the new row's hash_diff equals the latest version, you skip the insert.

The parallelism matters at scale. Sequence-based surrogate keys require a central generator, which becomes a bottleneck at high volume. Hash keys have no coordination cost, which is why Data Vault loads scale more linearly than most warehouse methodologies.

The Loading Pattern

  • Stage the source — land raw data in a staging table
  • Load hubs — insert new business keys only
  • Load links — insert new relationships only
  • Load satellites — insert rows where hash_diff changed
  • All loads are insert-only — highly parallel, never blocking

Tooling in 2026

Writing raw Data Vault SQL by hand is painful. Tools like AutomateDV, dbtvault, Datavault4dbt, and VaultSpeed generate the DDL and load procedures from metadata specs. You declare 'hub_customer uses business_key customer_id' and the tool writes the SQL. See data vault vs kimball for the broader comparison.

When Data Vault Is Overkill

Small teams with few sources do not need Data Vault. The overhead of hubs, links, and satellites is only worth it once you have 10+ source systems, meaningful audit pressure, or frequent schema changes that Kimball cannot absorb. For a startup analytics stack, Kimball is still faster to ship and easier to explain to stakeholders.

A useful heuristic: if you are spending more than half your sprint capacity on refactoring existing tables because a source schema changed, Data Vault is probably worth evaluating. If your central team keeps up with source changes comfortably, stay on Kimball.

Point-in-Time Queries

One of Data Vault's underrated features is point-in-time (PIT) tables — pre-joined views that materialize the state of a hub as of a specific date. These let analysts query history without manually writing temporal joins across satellites. AutomateDV and dbtvault both ship PIT table generators. If you need 'what did we know about customer X on date Y' queries regularly, PIT tables are essential.

Bridge tables are the companion pattern for many-to-many relationships across time. A bridge table records which hub instances were related during which time window, which makes it easy to answer questions like 'which accounts reported to which manager during Q3 2024'. Together, PIT and bridge tables turn Data Vault from an integration layer into a usable analytics surface.

Rollout Strategy

Do not try to refactor your whole warehouse into Data Vault in one quarter. Start with one domain (finance is a common first target because audit pressure is highest there), build the Raw Vault, validate the loads, and layer a Business Vault and Kimball marts on top. Expand to additional domains once the first is stable. Teams that try to boil the ocean typically stall after six months because the cognitive load of learning the pattern across many domains simultaneously is overwhelming.

Proof-of-value checkpoints matter. After domain one, measure audit query time, time-to-add-new-source, and analyst satisfaction. If any of those have not improved, stop and reassess before expanding. Sunk-cost bias is the Data Vault adoption killer — teams that invested six months in the pattern feel obligated to continue even when it is not working, which wastes another six months.

Agent-Managed Vaults

Data Workers' migration and schema agents can auto-generate new satellites when source schemas evolve, and quality agents monitor hash_diff stability. See autonomous data engineering or book a demo.

Data Vault is the enterprise answer to auditable integration at scale. Use hubs, links, and satellites to absorb source change without refactoring, split Raw from Business, and let tools like dbtvault generate the boilerplate. It is heavy, but worth it at enterprise scale where the alternative is endless refactoring.

See Data Workers in action

15 autonomous AI agents working across your entire data stack. MCP-native, open-source, deployed in minutes.

Book a Demo

Related Resources

Explore Topic Clusters