Building an Incident Debugging Agent: What We've Learned So Far
Our first working prototype, what it can do, and where it falls short
By The Data Workers Team
Data pipeline downtime costs enterprises $150K-$540K per hour. The average incident takes 2-4 hours to diagnose manually — most of that time spent tracing lineage across five different tools, not actually fixing the problem. We built the Incident Debugging Agent to eliminate that diagnostic bottleneck.
Incident debugging is where we started building. Not because it is the easiest problem, but because it is the most painful. Every data engineer we talked to described the same experience: an alert fires, you open your laptop, and you spend the next two to four hours manually tracing the problem across five different tools.
What the Agent Does
When a data incident occurs, the agent:
- •Ingests the alert context. What failed, when, what system reported it.
- •Runs diagnostic queries. Checks freshness, row counts, null rates, schema changes, and value distributions for the affected table and its upstream dependencies.
- •Traces lineage. Uses the Data Context and Catalog Agent's lineage information to identify upstream sources and downstream consumers. Maps the blast radius.
- •Correlates with recent changes. Checks dbt deployment logs, schema migration history, and orchestrator runs for recent changes that could explain the breakage.
- •Generates a diagnosis. Produces a structured incident report with probable root cause, affected assets, blast radius, and suggested remediation steps.
The engineer gets a diagnosis, not a symptom. Instead of "Table X freshness check failed," they get "Table X has not been updated in 6 hours because the upstream Airflow DAG failed at the extraction step due to a source API timeout. 3 downstream dashboards are affected. Suggested action: re-trigger the DAG with exponential backoff."
Early Results (With Caveats)
- •Mean time to diagnosis: Minutes instead of hours for manual debugging.
- •Root cause accuracy: The majority of diagnoses correctly identify the root cause. The remainder identify a contributing factor but miss the primary cause.
Important caveats: these are early results from a limited set of historical incidents across a small number of environments. We share them to show directional progress, not to claim production readiness.
Where This Fails
- •Novel failure modes. The agent struggles with failures it has not seen patterns for — a subtle data drift, a business logic change, an infrastructure issue that manifests as a data issue.
- •Cross-system correlation. When the root cause spans multiple systems, the agent's ability to correlate drops significantly.
- •Semantic understanding. The agent can tell you that values in a column changed. It cannot tell you whether those values are wrong.
- •False confidence. The agent sometimes generates plausible-sounding diagnoses that are wrong.
What We Learned About Trust
Data engineers do not trust a diagnosis they cannot verify. Every design partner conversation included some version of: "This is cool, but how do I know it is right?" The evidence chain — showing every query and result — is not a nice-to-have feature. It is the feature.
We also learned that the agent needs to say "I don't know." Our early versions always produced a diagnosis, even when the evidence was ambiguous. Engineers found this less trustworthy than an agent that says "I found these anomalies but cannot determine a definitive root cause."
Related Posts
What Ralph Kimball's Dimensional Modeling Taught Our Pipelines Agent
Ralph Kimball's four-step dimensional design process is one of the most durable ideas in data engineering — here is what it taught our pipelines agent.
What Jay Kreps's Log-Centric Architecture Taught Our Streaming Agent
Jay Kreps's core insight is deceptively simple: an append-only, totally-ordered log is not just a message bus — it is the single source of truth that eliminates N² integration pipelines and makes reprocessing routine. We studied his published writing and built a reusable streaming skill around the method.
What W. Edwards Deming's Plan-Do-Study-Act Taught Our Data Quality Agent
W. Edwards Deming spent a career arguing that quality comes from improving the process, not inspecting for defects. His Plan-Do-Study-Act cycle is the most rigorous improvement loop in the field. Here is how we encoded it into our data quality agent.