guideApr 24, 20265 min read

Mcp Server Datahub Metadata

Written by The Data Workers Team — 14 autonomous agents shipping production data infrastructure since 2026.

Technically reviewed by the Data Workers engineering team.

Last updated Apr 24, 2026.

A DataHub MCP server exposes the catalog's GraphQL API to agents so they can search entities, resolve lineage, and read glossary terms through a single MCP endpoint. Connecting it correctly means creating a service user, minting a personal access token with the right privileges, and scoping queries to the subset of the metadata graph the agent actually needs.

DataHub is one of the leading open-source catalogs, and exposing it through MCP lets agents discover data assets the same way humans do through the UI. This guide covers authentication, GraphQL entry points, entity search, lineage walks, and the operational patterns that keep a DataHub MCP server useful in production.

Why Expose DataHub via MCP

Most agent failures on data questions are context failures — the agent does not know which tables exist, what they mean, or how they are connected. DataHub already holds all of that metadata, often curated by a dedicated team. Exposing it via MCP turns that curated knowledge into first-class tool calls the agent can use on every question.

The alternative is baking a schema dump into the prompt, which goes stale within a day and blows up context windows. An MCP server keeps metadata live — every query hits the current graph, and the agent always sees the latest tables, owners, and tags.

Authentication and PATs

DataHub supports personal access tokens (PATs) for service accounts. Create a PAT in the DataHub UI under Settings → Access Tokens, name it mcp-agent, and scope it to the curated tenant. Load the token into the MCP server via environment variable and rotate it every 90 days. Do not use a human PAT or a wildcard token.

•Service PAT — dedicated for the MCP server
•Scoped to tenant — not the root org
•GraphQL endpoint — HTTPS only
•Rate limit — honor DataHub's 100 req/min default
•Fallback to read-only — never mutate metadata

Key MCP Tools for DataHub

A useful DataHub MCP server exposes five or six tools that map to DataHub's GraphQL API: searchEntities, getEntity, getLineage, getOwners, getGlossaryTerm, and getDocumentation. Each wraps a GraphQL query and returns a trimmed-down JSON shape the agent can reason about. Avoid exposing raw GraphQL — it is too permissive and too verbose.

Tool	GraphQL Query	Use
searchEntities	searchAcrossEntities	Find datasets by keyword
getEntity	dataset(urn)	Load full entity record
getLineage	lineage(urn, direction)	Walk upstream or downstream
getOwners	entity.ownership	Who to contact
getGlossaryTerm	glossaryTerm(urn)	Business definitions
getDocumentation	dataset.properties.description	Curated docs

Entity Search Patterns

Agents usually start with a natural-language query and need to find the right dataset. The searchEntities tool should accept a keyword and optional filters (platform, type, owner), then return a ranked list of URNs with titles, descriptions, and match scores. The MCP server should trim irrelevant fields before returning — DataHub responses are verbose and will otherwise bloat the agent's context.

Lineage Walks

Lineage is one of DataHub's strongest features and one of the most useful MCP capabilities. The getLineage tool should accept a URN and a direction (upstream, downstream, both) and return a graph trimmed to a reasonable depth (2-3 hops). Agents can use lineage to answer questions like what depends on this table? without touching the underlying warehouse.

Governance and PII

DataHub tags entities with glossary terms and tags that often encode PII status. An MCP server should respect those tags — if an entity is marked PII, the agent should either see a redacted description or be blocked from touching it. Wire the tag policy at the MCP server layer so the enforcement is automatic.

Data Workers on DataHub

Data Workers' DataHub connector handles PAT auth, exposes the six core tools above, trims responses, and enforces tag-based policies. The catalog agent can federate across DataHub plus other catalogs, giving the agent a unified metadata plane. See AI for data infrastructure for the full agent stack or compare to MCP server OpenMetadata lineage.

To see a DataHub MCP server powering agent workflows with live metadata, book a demo. We will walk through PAT setup, tool design, and lineage walks.

DataHub's recent work on data contracts is also worth exposing via MCP. Contracts declare expected schemas, quality constraints, and SLAs that the catalog enforces automatically. An MCP tool that reads contracts gives the agent a machine-readable spec for how each table should behave — far more useful than prose documentation. When an agent detects a contract violation, it can flag it in the audit log and surface the broken contract to the owner.

Another capability DataHub users should expose is the tag and term graph. DataHub models tags and glossary terms as first-class entities with their own relationships, and the graph is searchable via GraphQL. An MCP server that exposes searchByTag and searchByTerm lets the agent find every dataset tagged gdpr or every asset attached to the revenue glossary term. That is vastly more useful than keyword search on table names.

For teams using DataHub's ingestion framework to pull metadata from dbt, Airflow, and warehouses, the MCP server inherits freshness for free — ingested metadata is usually only a few hours old. Monitor the ingestion jobs and surface the last-ingested timestamp in MCP responses so the agent can reason about staleness. A six-month-old metadata record is worse than none; a one-hour-old record is gold.

DataHub is one of the best MCP backends for metadata because it already holds the curated truth about your data stack. A small set of well-designed tools lets an agent consume that truth without leaking sensitive context or blowing up the prompt window.

Sources

See Data Workers in action

15 autonomous AI agents working across your entire data stack. MCP-native, open-source, deployed in minutes.

Book a Demo

Related Resources

Mcp Server Amundsen Metadata — Mcp Server Amundsen Metadata
Mcp Server Collibra Metadata — Mcp Server Collibra Metadata
Mcp Server Atlan Metadata — Mcp Server Atlan Metadata
Mcp Server Alation Metadata — Mcp Server Alation Metadata
Mcp Server Unity Catalog Metadata — Mcp Server Unity Catalog Metadata
Why AI Agents Need MCP Servers for Data Engineering — MCP servers give AI agents structured access to your data tools — Snowflake, BigQuery, dbt, Airflow, and more. Here is why MCP is the int…
MCP Server Analytics: Understanding How Your AI Tools Are Actually Used — Your team uses dozens of MCP tools every day. MCP analytics tracks adoption, measures ROI, identifies unused tools, and provides the usag…
How to Build an MCP Server for Your Data Warehouse (Tutorial) — MCP servers give AI agents structured access to your data warehouse. This tutorial walks through building one from scratch — TypeScript,…
MCP Server Security: Authentication, Authorization, and Audit Trails — MCP servers expose powerful capabilities to AI agents. Securing them requires OAuth 2.1 authentication, scoped authorization, least-privi…
MCP Server for Snowflake: Connect AI Agents to Your Data Warehouse — Snowflake's MCP server exposes Cortex Analyst, Cortex Search, and schema metadata to AI agents. Here's how to set it up and how Data Work…
MCP Server for BigQuery: Give AI Agents Access to Your Analytics — BigQuery's MCP server gives AI agents access to schemas, query execution, and cost estimation. Here's how to connect it and use Data Work…
MCP Server Tutorial: Build a Data Warehouse Integration in 30 Minutes (Python) — Build an MCP server for your data warehouse in 30 minutes with Python. Step-by-step tutorial covering schema exposure, query execution, a…

Explore Topic Clusters

Data Governance: The Complete Guide — Policies, access controls, PII, and compliance at scale.
Data Catalog: The Complete Guide — Discovery, metadata, lineage, and the modern catalog stack.
Data Lineage: The Complete Guide — Column-level lineage, impact analysis, and observability.
Data Quality: The Complete Guide — Tests, SLAs, anomaly detection, and data reliability engineering.
AI Data Engineering: The Complete Guide — LLMs, agents, and autonomous workflows across the data stack.
MCP for Data: The Complete Guide — Model Context Protocol servers, tools, and agent integration.
Data Mesh & Data Fabric: The Complete Guide — Federated ownership, domain-oriented architecture, and interop.
Open-Source Data Stack: The Complete Guide — dbt, Airflow, Iceberg, DuckDB, and the modern OSS toolkit.
AI for Data Infra — The complete category for AI agents built specifically for data engineering, data governance, and data infrastructure work.