guide5 min read

Lineage Agent Column Level Capture

Lineage Agent Column Level Capture

Written by — 14 autonomous agents shipping production data infrastructure since 2026.

Technically reviewed by the Data Workers engineering team.

Last updated .

Data Workers' Lineage Agent captures column-level lineage across SQL transformations, dbt models, Python scripts, and cross-platform data flows — providing the precision required for impact analysis, debugging, and regulatory compliance that table-level lineage cannot deliver. Column-level lineage answers the question 'exactly which source columns contribute to this report metric,' not just 'which tables were involved.'

This guide covers the Lineage Agent's column-level capture methodology, SQL parsing capabilities, cross-platform lineage tracking, and use cases where column-level precision makes the difference between actionable lineage and decorative metadata.

Why Column-Level Lineage Matters

Table-level lineage tells you that a report depends on a staging table. Column-level lineage tells you that the revenue metric in cell B7 of the quarterly report is calculated from the sum of the amount column in the orders staging table, which is sourced from the total_amount field in the Salesforce Opportunity object, converted from the opportunity's currency to USD using the exchange_rate column from the finance.fx_rates table.

This precision is critical for three use cases: impact analysis (which reports break if I rename this column?), debugging (where did this wrong number come from?), and compliance (can I prove that this metric was calculated correctly from authoritative sources?). Without column-level lineage, each of these exercises requires manual SQL reading that scales poorly.

Use CaseTable-Level LineageColumn-Level Lineage
Impact analysisWhich tables are downstream?Which specific columns and calculations are affected?
Root cause debuggingWhich tables contributed to the report?Which exact column transformation produced the wrong value?
Regulatory complianceData flows between these systemsThis metric is computed from these specific source fields using this formula
Schema change reviewThis table has downstream consumersThis column is used in 47 SELECT clauses and 12 WHERE filters
Data qualitySomething is wrong in this tableThis column's null rate spiked because the source column changed type
DocumentationThis table feeds these reportsThis column means X, calculated as Y from Z source fields

SQL Parsing for Column Lineage

The Lineage Agent parses SQL at the AST (Abstract Syntax Tree) level to extract column-level dependencies. It handles standard SQL patterns (SELECT, JOIN, WHERE, GROUP BY, HAVING, window functions, CTEs, subqueries) and warehouse-specific extensions (Snowflake QUALIFY, BigQuery STRUCT, Redshift DISTKEY). The parser resolves column aliases, expands SELECT *, and traces columns through multi-level CTEs and subquery nesting.

SQL parsing is the foundation of accurate column lineage. Pattern-matching approaches (regex on SQL strings) fail on edge cases like column aliases, nested subqueries, and CASE expressions. AST-level parsing handles these correctly by building a complete model of the query's column flow, from source table columns through every transformation to the final output columns.

  • CTE resolution — traces columns through multi-level Common Table Expressions with proper scoping
  • Subquery handling — resolves correlated and uncorrelated subqueries to identify column dependencies
  • Window functions — captures PARTITION BY and ORDER BY column dependencies for analytics functions
  • CASE expressions — traces all branches to identify every column that could contribute to the output
  • UDF resolution — maps user-defined function inputs to outputs when function definitions are available
  • **SELECT * expansion** — resolves wildcard selections to specific column lists based on source table schemas

Cross-Platform Lineage

Real data pipelines span multiple platforms. A column's lineage might start in a Postgres OLTP database, flow through a Kafka topic, land in a Snowflake staging table, get transformed through dbt, and end up in a Looker dashboard. The Lineage Agent tracks columns across these platform boundaries by maintaining a unified lineage graph that connects platform-specific lineage segments through shared identifiers (table names, column names, topic schemas).

Cross-platform lineage is especially valuable for regulated industries where auditors need end-to-end provenance. The agent produces a complete chain from the system of record through every transformation to the final consumer, with each link annotated with the transformation logic, the platform, and the execution timestamp. This chain provides the evidence that regulatory compliance requires.

dbt Model Lineage

For dbt projects, the Lineage Agent enhances dbt's built-in lineage with column-level precision. While dbt's manifest tracks model-to-model dependencies, the Lineage Agent traces individual columns through refs and sources, capturing exactly which source columns flow into each output column. This column-level dbt lineage enables precise impact analysis: when a source column changes, the agent identifies every downstream model column that is affected, not just every downstream model.

The agent also captures lineage through dbt macros and Jinja templates. When a macro generates SQL dynamically, the agent evaluates the macro with representative parameters to determine the column-level lineage. This handles the common pattern of shared transformation logic that is difficult to trace manually.

Column-level lineage graphs can be overwhelming when displayed in full. The Lineage Agent provides focused visualization: starting from a specific column of interest, it shows only the upstream sources and downstream consumers of that column, with transformation logic annotated on each edge. Users can expand the graph incrementally to explore related columns without being overwhelmed by the full platform lineage.

Lineage search enables natural language queries: 'where does the revenue number on the board deck come from?' resolves to the specific column lineage chain from source to presentation. This search capability makes lineage accessible to non-technical users who need provenance information but cannot read SQL.

Integration and Use Cases

Column-level lineage powers multiple downstream capabilities: the Schema Agent uses it for breaking change review, the Incidents Agent uses it for root cause analysis, and the Governance Agent uses it for regulatory evidence. The lineage graph is the connective tissue that makes all these capabilities precise rather than approximate.

For teams building comprehensive lineage capabilities, column-level capture works alongside impact analysis for change management and regulatory evidence for compliance. Book a demo to see column-level lineage on your data warehouse.

Column-level lineage is the precision layer that transforms lineage from a diagram on a wall into an operational tool. The Lineage Agent parses SQL at the AST level, tracks columns across platforms, and provides the exact source-to-destination mappings that impact analysis, debugging, and compliance require.

See Data Workers in action

15 autonomous AI agents working across your entire data stack. MCP-native, open-source, deployed in minutes.

Book a Demo

Related Resources

Explore Topic Clusters