guide5 min read

3 Layer Context System For Data

3 Layer Context System For Data

Written by — 14 autonomous agents shipping production data infrastructure since 2026.

Technically reviewed by the Data Workers engineering team.

Last updated .

A three-layer context system for data agents separates schema (what tables exist), semantics (what they mean), and signals (how humans use them). Retrieving each layer independently produces tighter, more accurate context than stuffing everything into one embedding index.

Most teams build a single embedding index of their schema and call it context. It works for demos and fails for production. The fix is to split context into three layers with different retrieval strategies for each. This guide explains the pattern and when to use it. Compare to 6-layer context system for data and AI for data infrastructure.

The Three Layers

Layer one is schema: table names, column names, data types, foreign keys. It is structured, authoritative, and changes every time the warehouse changes. Layer two is semantics: descriptions, business definitions, glossary entries, tags. It is unstructured but canonical, owned by humans, and changes slowly. Layer three is signals: query logs, dashboard references, canonicality scores, corrections log. It is continuously updated from actual usage.

  • Layer 1 (Schema) — tables, columns, types, keys — from warehouse
  • Layer 2 (Semantics) — descriptions, glossary, tags — from humans plus LLM enrichment
  • Layer 3 (Signals) — query logs, dashboards, canonicality, corrections — from usage
  • Each layer retrieved independently
  • Each layer has its own update cadence
  • Each layer has its own storage

Why Three Layers Beat One

A single embedding index conflates authoritative signals (schema) with opinion-based signals (descriptions) with behavioral signals (query logs). The conflation means ranking becomes fuzzy — a table that is popular but deprecated outranks a canonical but obscure one. Splitting the layers lets each rank by its own metric before combining.

The combination is where the power comes from. Schema gives precise recall (I need this exact table). Semantics gives interpretation (this table means revenue net of refunds). Signals give trust (humans actually query this table). Together they produce retrieval that no single index can match.

Retrieval Strategy Per Layer

Schema uses exact-match plus structural similarity. Semantics uses embeddings plus glossary lookups. Signals uses scoring functions over query logs and usage metrics. Each retrieval produces a shortlist, and the agent merges the shortlists with weighted re-ranking.

The weights depend on the question. A user asking what is our revenue definition leans heavily on semantics. A user asking which table has the most recent orders leans heavily on signals. A user asking about joins leans heavily on schema. The agent infers the right weighting from the question and adjusts.

Update Cadence

Schema updates on every warehouse change (seconds to minutes). Semantics updates when humans or LLM enrichment ships (hours to days). Signals update continuously as queries run (real-time). Each layer has its own pipeline with its own latency budget, and the layers do not block each other.

When Three Layers Are Enough

Three layers are enough for most teams. The complexity of more layers does not pay off until you have thousands of tables, multiple domains, and a mature glossary. Start with three, measure accuracy, and split further only if you hit a ceiling. See the six-layer variant when three is not enough.

Common Mistakes

The biggest mistake is a single index that conflates everything. The second is treating schema as semantics by dumping column descriptions into the same embedding space. The third is ignoring signals entirely, which means canonicality never enters retrieval. The fourth is manually tuning layer weights instead of learning them from feedback.

Data Workers ships a three-layer context system out of the box with independent retrieval per layer and learned weighting from corrections feedback. Teams go from 40 percent accuracy to 80 percent by switching from a single index to the three-layer pattern. To see it run, book a demo.

Layer Update Cadences

The three layers update at different cadences, which is both a feature and a source of complexity. Schema is driven by warehouse changes (event-driven, sub-minute). Semantics is driven by human edits and LLM enrichment (hours to days). Signals are driven by query logs (real-time stream). The layers have to stay consistent even when they update on different timelines.

Consistency is maintained by versioning. Each layer snapshot has a version, and the agent queries them together for the same version whenever possible. If a new schema version is available but new semantics is not, the agent uses the new schema with the old semantics until the semantics catches up. No partial-update states are exposed to the agent.

This versioning discipline makes rollback clean and debugging deterministic. When an answer is wrong, you can replay the same question against the same layer versions and reproduce the bug. Data Workers treats every layer version as an immutable snapshot so replay is always possible.

Choosing Storage for Each Layer

Schema lives naturally in a relational catalog because the underlying data is structured. Postgres or a dedicated catalog like OpenMetadata both work. Semantics lives in a vector database because most of the content is unstructured text that needs embedding-based retrieval. Signals live in a time-series or analytics store because the queries are over historical usage patterns.

Mixing storage types is fine and often optimal. Each store is small, each has its own access pattern, and each is sized for its workload. The cross-store queries happen at the retrieval layer, where the orchestrator calls each store in parallel and merges results. The orchestrator is the glue that makes heterogeneous storage feel like one system.

Data Workers lets teams plug in their preferred stores or use the defaults. Most teams start with the defaults and replace individual stores only when they hit scale limits. The architecture supports both without forcing a choice upfront.

Teams that adopt the three-layer model early avoid the painful refactor that teams doing flat context eventually face. A flat context store works for the first 50 tables and 100 glossary entries, then starts breaking down as retrieval gets noisy and accuracy drops. Splitting into layers is surgery on a running system, which means downtime, regressions, and retraining. Starting with layers costs a day of extra setup and saves months of refactoring later. The architecture decision compounds with every table, every glossary entry, and every correction the system ingests.

Three-layer context is the starting point for production-grade data agents. Split schema, semantics, and signals; retrieve each independently; merge with learned weights, and accuracy jumps.

See Data Workers in action

15 autonomous AI agents working across your entire data stack. MCP-native, open-source, deployed in minutes.

Book a Demo

Related Resources

Explore Topic Clusters