guide5 min read

Mcp Server Data Dictionary Exposure

Mcp Server Data Dictionary Exposure

Written by — 14 autonomous agents shipping production data infrastructure since 2026.

Technically reviewed by the Data Workers engineering team.

Last updated .

A data dictionary MCP server exposes curated table and column descriptions to agents through a simple lookup tool, so agents can cite the approved definitions instead of guessing. This is often the highest-leverage MCP integration a data team can ship, because it turns existing documentation into live agent context.

Most companies have a data dictionary somewhere — in dbt docs, Confluence, Notion, or a spreadsheet. The problem is that the dictionary is invisible to agents by default, so agents invent their own definitions. Exposing the dictionary through MCP fixes this and takes a few hours of work. This guide covers the patterns.

Why the Data Dictionary Is High-Leverage

Agents hallucinate on data questions mostly because they lack context. They see a table name like fct_orders and infer what the columns mean, often incorrectly. A data dictionary gives them the approved definitions: order_total is the gross order amount in USD before tax and discounts, measured at checkout. That single sentence can turn a confidently wrong agent into a confidently correct one.

The reason this is high-leverage is that the dictionary already exists. Someone on the data team has written it for analysts. MCP just makes it visible to the agent. No new content creation, no ongoing maintenance — just a thin server wrapping an existing source.

Where the Dictionary Lives

Common sources include dbt model YAML files, dbt docs generated sites, Confluence pages, Notion databases, shared Google Sheets, or inline table comments in the warehouse. The MCP server needs to pick one source of truth and wrap it. For most teams this is either dbt or dbt docs.

  • dbt YAML — model and column descriptions
  • dbt docs JSON — generated manifest.json
  • Confluence / Notion — via REST API
  • Warehouse COMMENT — native SQL comments
  • Google Sheets — via Sheets API

Core MCP Tool

The tool can be as simple as lookupDefinition(entity: string) where entity is a table or column name. The server returns the description, examples, owner, and source (defined in dbt, last updated Mar 2026). For fuzzy matching, add a search_glossary tool that accepts a natural-language phrase and returns the best matches.

SourceExtraction MethodFreshness
dbt YAMLRead schema.yml filesOn every dbt run
dbt docs manifestParse manifest.jsonDaily rebuild
ConfluenceREST APIReal-time
NotionPages APIReal-time
Warehouse COMMENTinformation_schema.columnsOn deploy
SheetsSheets APIReal-time

Ranking and Deduplication

If you expose multiple sources, agents will sometimes see conflicting definitions. Pick a priority order (dbt YAML > warehouse COMMENT > Confluence > Sheets) and have the MCP server return only the winning definition. Log the alternatives in the response metadata so analysts can spot drift without confusing the agent.

Author Attribution

Include the author or owner in every response so the agent can cite the source. That turns a plain definition into a trust-building artifact: order_total is defined by Finance Data Team (Jane Smith) as the gross amount before tax and discounts — source: dbt docs, last updated Mar 2026. Users trust the answer more when they can see who wrote it.

Freshness Signals

Stale definitions are worse than no definitions because they give the agent false confidence. Include a last_updated field in every response and mark entries older than 6 months as possibly stale. The agent can then hedge its answer and suggest the user confirm with the owner.

Data Workers Approach

Data Workers' catalog agent ships with a data dictionary tool that ingests from dbt, Confluence, Notion, and warehouse COMMENTs simultaneously, deduplicates by priority, and surfaces author plus freshness signals. See AI for data infrastructure for the full agent stack or read MCP server business glossary exposure for the glossary-focused variant.

To see a data dictionary MCP server powering an agent with curated definitions, book a demo. We will walk through source ingestion, deduplication, and freshness.

One common mistake is exposing the dictionary without versioning. Definitions change — a metric gets refined, a field gets split, a table gets deprecated — and agents that cached old definitions produce wrong answers. The MCP server should either fetch live every time (simpler) or expose the version number with every response so the agent can detect staleness. Never cache without a version tag.

Another consideration is the tradeoff between breadth and depth. A dictionary that covers 100% of tables with one-line descriptions is less useful than one that covers the top 20% with rich definitions, examples, and owner contact info. Encourage the data team to prioritize depth on the most-used tables and accept that the long tail can stay thin. The agent's search should weight by depth so rich entries surface first.

Finally, measure the hit rate. Log which dictionary lookups return a definition versus which return nothing, and use the miss list to prioritize new entries. Over time this feedback loop turns a static dictionary into a living resource driven by actual agent usage. Teams that do this find the dictionary filling in exactly the tables agents (and their users) care most about.

Exposing the data dictionary through MCP is the single highest-leverage integration most data teams can ship. Existing docs become live agent context, hallucinations drop, and the data team gets leverage on documentation they already wrote.

See Data Workers in action

15 autonomous AI agents working across your entire data stack. MCP-native, open-source, deployed in minutes.

Book a Demo

Related Resources

Explore Topic Clusters