EngineeringSep 22, 202512 min read

Building an Incident Debugging Agent: What We've Learned So Far

Our first working prototype, what it can do, and where it falls short

By The Data Workers Team

Data pipeline downtime costs enterprises $150K-$540K per hour. The average incident takes 2-4 hours to diagnose manually — most of that time spent tracing lineage across five different tools, not actually fixing the problem. We built the Incident Debugging Agent to eliminate that diagnostic bottleneck.

Incident debugging is where we started building. Not because it is the easiest problem, but because it is the most painful. Every data engineer we talked to described the same experience: an alert fires, you open your laptop, and you spend the next two to four hours manually tracing the problem across five different tools.

What the Agent Does

When a data incident occurs, the agent:

•Ingests the alert context. What failed, when, what system reported it.
•Runs diagnostic queries. Checks freshness, row counts, null rates, schema changes, and value distributions for the affected table and its upstream dependencies.
•Traces lineage. Uses the Data Context and Catalog Agent's lineage information to identify upstream sources and downstream consumers. Maps the blast radius.
•Correlates with recent changes. Checks dbt deployment logs, schema migration history, and orchestrator runs for recent changes that could explain the breakage.
•Generates a diagnosis. Produces a structured incident report with probable root cause, affected assets, blast radius, and suggested remediation steps.

The engineer gets a diagnosis, not a symptom. Instead of "Table X freshness check failed," they get "Table X has not been updated in 6 hours because the upstream Airflow DAG failed at the extraction step due to a source API timeout. 3 downstream dashboards are affected. Suggested action: re-trigger the DAG with exponential backoff."

Early Results (With Caveats)

•Mean time to diagnosis: Minutes instead of hours for manual debugging.
•Root cause accuracy: The majority of diagnoses correctly identify the root cause. The remainder identify a contributing factor but miss the primary cause.

Important caveats: these are early results from a limited set of historical incidents across a small number of environments. We share them to show directional progress, not to claim production readiness.

Where This Fails

•Novel failure modes. The agent struggles with failures it has not seen patterns for — a subtle data drift, a business logic change, an infrastructure issue that manifests as a data issue.
•Cross-system correlation. When the root cause spans multiple systems, the agent's ability to correlate drops significantly.
•Semantic understanding. The agent can tell you that values in a column changed. It cannot tell you whether those values are wrong.
•False confidence. The agent sometimes generates plausible-sounding diagnoses that are wrong.

What We Learned About Trust

Data engineers do not trust a diagnosis they cannot verify. Every design partner conversation included some version of: "This is cool, but how do I know it is right?" The evidence chain — showing every query and result — is not a nice-to-have feature. It is the feature.

We also learned that the agent needs to say "I don't know." Our early versions always produced a diagnosis, even when the evidence was ambiguous. Engineers found this less trustworthy than an agent that says "I found these anomalies but cannot determine a definitive root cause."

EngineeringFeb 3, 2026

Building an Incident Debugging Agent: What We've Learned So Far

What the Agent Does

Early Results (With Caveats)

Where This Fails

What We Learned About Trust

Related Posts

Why We Bet on MCP (And What We're Still Figuring Out)

Building a Quality Monitoring Agent: Lessons From Alert Fatigue

Why Schema Changes Are the Silent Killer of Data Pipelines