guideApr 24, 20265 min read

Progressive Context Disclosure

Written by The Data Workers Team — 14 autonomous agents shipping production data infrastructure since 2026.

Technically reviewed by the Data Workers engineering team.

Last updated Apr 24, 2026.

Progressive context disclosure is a retrieval pattern where the agent sees a compact index first and pulls detail only for tables it actually needs. It cuts token costs 70 to 90 percent and boosts accuracy by reducing context bloat.

Naive agents dump the full schema into every prompt. Past a few hundred tables, that breaks. Progressive disclosure fixes the bloat by showing the agent a tight summary first and letting it ask for detail as needed. This guide explains the pattern and how to implement it. Related: context bloat for AI agents and AI for data infrastructure.

The Pattern

Step one: the agent receives a compact index — table names, short descriptions, canonicality scores. Step two: the agent picks the 5 to 10 tables most likely to answer the question. Step three: the agent pulls full schema (columns, types, sample values, tests) for only those tables. Step four: the agent generates SQL.

The key insight is that steps one and two are cheap (small context, small LLM call) while step three is expensive (bigger context, bigger call). Doing step one for the whole warehouse and step three for only 5 to 10 tables keeps costs bounded even on huge warehouses.

Why This Beats Dumping Everything

A warehouse with 5,000 tables has roughly 50,000 to 100,000 columns. Dumping all of that into the prompt costs tens of thousands of tokens and wastes most of them on tables the agent will never use. Progressive disclosure surfaces only the relevant 5 to 10 tables, which is 50 to 100 columns — a 1,000x reduction.

Beyond cost, accuracy improves because the model is not distracted by irrelevant tables. Picking between 10 candidates is far easier than picking between 5,000. The agent spends its reasoning budget on the right question instead of filtering noise.

What the Compact Index Looks Like

•Table name — fully qualified
•One-line description — enriched from metadata or LLM
•Canonicality score — derived from usage signals
•Domain tag — finance, product, ops, etc.
•Last updated — for freshness signal
•Row count estimate — for rough size
•Top 3 columns — for quick pattern match

Expansion Criteria

The agent expands detail only for tables that pass a threshold. The threshold is a weighted sum of semantic similarity, canonicality score, domain match, and recency. Typical cutoffs let through 5 to 10 tables per question. Too few and the agent misses good candidates; too many and you are back to bloat.

The cutoff should be learned, not hardcoded. Start with a heuristic and adjust based on accuracy feedback. A good adaptive cutoff produces 5 to 10 tables for simple questions and expands to 15 to 20 for complex multi-join questions automatically.

When to Re-Expand

Sometimes the agent picks a shortlist, looks at the detail, and realizes it needs more tables. The pattern has to support re-expansion: the agent requests additional candidates, the retriever returns them, the agent proceeds. Re-expansion is rare if the initial shortlist is good, but it must be supported as a fallback.

Caching Expanded Detail

Expanded detail is expensive to compute. Cache it per table with a TTL matching the warehouse update rate. Subsequent requests that need the same table hit the cache and skip the expensive retrieval. Cache invalidation on schema change is cheap because the event stream drives it.

Common Mistakes

The worst mistake is skipping progressive disclosure and relying on long-context models. It seems easier but accuracy suffers. Another mistake is a compact index that is too sparse to pick good shortlists. A third is no re-expansion fallback, which means the agent gives up when its first shortlist is wrong. A fourth is not caching expanded detail, which wastes compute.

Data Workers ships progressive disclosure as the default retrieval pattern for data agents. Tokens drop by 70 to 90 percent and accuracy climbs because the agent spends reasoning on the right tables. To see it run on your warehouse, book a demo.

What Goes in the Compact Index

The compact index is the key artifact. It has to be small enough to fit in a few thousand tokens and rich enough to pick good shortlists. The right contents depend on the warehouse but usually include table name, one-line description, canonicality score, domain tag, row count, top three columns, and last update time. That is about 100 tokens per table, so 20 tables fit in 2k tokens and 200 tables fit in 20k tokens.

The index itself must update continuously as the warehouse evolves. New tables get added, old tables get deprecated, descriptions change. A stale index leads to wrong shortlists, which leads to wrong answers. Treat the index like any other context artifact: continuous update, validation, monitored accuracy.

For very large warehouses with tens of thousands of tables, even the compact index is too big to fit in one prompt. The fix is hierarchical: an index of indexes, where the agent first picks a domain (top-level index), then picks tables within the domain (domain-level index), then picks columns within the tables. Each level is small enough to fit.

Caching Strategies

The compact index can be cached aggressively because it changes slowly. A cache with a five-minute TTL hits 95 percent of requests and the agent rarely has to wait for the catalog. Expanded details can also be cached per table, but with a shorter TTL because schema changes propagate faster.

Cache invalidation is event-driven. When the catalog emits a schema-change event, the relevant cache entry gets invalidated immediately. This combines the speed of caching with the freshness of live retrieval, which is the right tradeoff for production systems.

Data Workers handles cache invalidation automatically using OpenLineage events. Teams do not have to wire it themselves, and the cache hit rate stays high without sacrificing freshness. The result is fast retrieval that still reflects the current state of the warehouse.

Progressive disclosure is the scaling answer for retrieval on large warehouses. Start with a compact index, expand selectively, cache aggressively, and support re-expansion. The result is fast, cheap, and accurate retrieval that works at warehouse scale.

Sources

See Data Workers in action

15 autonomous AI agents working across your entire data stack. MCP-native, open-source, deployed in minutes.

Book a Demo

Related Resources

Context-Compounding Agents: How Claude Gets Smarter About Your Data Over Time — Context-compounding agents accumulate knowledge across sessions via CLAUDE.md persistent memory.
Context Engineering for Data: How to Give AI Agents the Knowledge They Need — Context engineering gives AI agents schemas, lineage, quality scores, business rules, and tribal knowledge.
Building a Context Graph with MCP: Architecture Patterns for Data Teams — Build a context graph by connecting your data catalog, lineage tools, quality monitors, and semantic layer via MCP — creating one queryab…
Context Layer Architecture: 5 Patterns for Giving AI Agents Data Understanding — Five architecture patterns for building a context layer: centralized, federated, hybrid, MCP-native, and graph-based. Here's when to use…
Context Layer for Snowflake: Give AI Agents Full Understanding of Your Warehouse — Build a context layer on Snowflake by connecting Cortex AI, schema metadata, lineage graphs, and quality scores — giving AI agents full u…
Context Layer for Databricks: Unity Catalog + AI Agents — Databricks Unity Catalog provides metadata governance. A context layer adds lineage, quality scores, and semantic definitions — enabling…
Context Layer for BigQuery: Connect AI Agents to Google Cloud Analytics — Build a context layer for BigQuery that gives AI agents metadata access, lineage understanding, quality signals, and cost-aware query pla…
How to Evaluate Context Layer Vendors: Buyer's Checklist for Data Leaders — Evaluating context layer vendors? This checklist covers 15 criteria: MCP support, agent compatibility, lineage depth, semantic integratio…
The Context Layer ROI: Quantifying the Business Impact of AI-Ready Data — A context layer delivers measurable ROI: 66% query accuracy improvement, $1.3M+ annual savings from reduced toil, 30-40% warehouse cost r…
When LLMs Hallucinate About Your Data: How Context Layers Prevent AI Misinformation — LLMs hallucinate 66% more often when querying raw tables vs through a semantic/context layer. Here is how context layers prevent AI misin…
Context Bloat Ai Agents — Context Bloat Ai Agents
Corrections Log Context Layer — Corrections Log Context Layer

Explore Topic Clusters

Data Governance: The Complete Guide — Policies, access controls, PII, and compliance at scale.
Data Catalog: The Complete Guide — Discovery, metadata, lineage, and the modern catalog stack.
Data Lineage: The Complete Guide — Column-level lineage, impact analysis, and observability.
Data Quality: The Complete Guide — Tests, SLAs, anomaly detection, and data reliability engineering.
AI Data Engineering: The Complete Guide — LLMs, agents, and autonomous workflows across the data stack.
MCP for Data: The Complete Guide — Model Context Protocol servers, tools, and agent integration.
Data Mesh & Data Fabric: The Complete Guide — Federated ownership, domain-oriented architecture, and interop.
Open-Source Data Stack: The Complete Guide — dbt, Airflow, Iceberg, DuckDB, and the modern OSS toolkit.
AI for Data Infra — The complete category for AI agents built specifically for data engineering, data governance, and data infrastructure work.