Building a Quality Monitoring Agent: Lessons From Alert Fatigue
How we are trying to fix the signal-to-noise problem in data quality
By The Data Workers Team
Alert fatigue is the silent killer of data quality programs. You deploy a quality tool, configure monitors, and within weeks you are drowning in alerts — 40-60% of which are false positives or low-priority noise. Engineers start ignoring alerts. Real issues get missed.
The cost is measurable: teams we talked to report that 40-60% of their on-call time goes to triaging alerts that turn out to be noise. That is senior engineering talent — $200K+ fully loaded — spent on work that follows the same patterns week after week. The quality tools are doing their job (detecting). The problem is everything that happens after detection: prioritization, correlation, root cause analysis, and remediation.
What the Agent Does
The Quality Monitoring Agent runs continuously and performs four functions:
- •Anomaly detection with context. Standard statistical monitoring but correlated with known events. If an upstream system had scheduled maintenance, the agent suppresses the downstream freshness alerts.
- •Impact-weighted prioritization. Not all tables are equal. A quality issue on a table that feeds your CFO's revenue dashboard is more important than an issue on an internal experimentation table.
- •Alert correlation. When 12 alerts fire in a 10-minute window, they are usually 1-2 root causes with multiple symptoms. The agent groups correlated alerts and presents them as incidents.
- •Quality scoring. Each monitored asset gets a composite quality score across freshness, completeness, consistency, accuracy, and conformity dimensions.
What We Learned About Auto-Remediation
- •Retry is the easy case. If a pipeline failed due to a transient error, retrying is usually safe. Maybe 20-25% of incidents fall into this category.
- •Backfill is dangerous. Automatically backfilling data sounds simple until the backfill overwrites corrected data or triggers expensive full-table recomputation.
- •Default values are almost never right. Replacing nulls with defaults is a business logic decision, not a data engineering decision.
Our current approach: the agent can auto-remediate retries (with configurable limits). Everything else gets a recommended action with a one-click approval. The human decides. The agent executes.
Honest Failures
- •Seasonality blindness. Our anomaly detection initially flagged every weekend dip as an anomaly. Adding day-of-week patterns helped, but we still get false positives around holidays and campaigns.
- •The cold start problem. The agent needs historical data to establish baselines. For new tables, the initial monitoring period generates unreliable alerts.
- •Custom business rules. Every company has data quality rules specific to their business. The agent can learn these from examples, but defining them is still manual work.
- •Quality score calibration. Our quality scores are useful for relative comparison but the absolute numbers are not yet meaningful.
Working prototypes and active design partner conversations. A lot of honest lessons about what agents can and cannot do in production data environments.
Related Posts
Why We Bet on MCP (And What We're Still Figuring Out)
When we started building Data Workers, we had to make a foundational decision: how do our AI agents connect to the dozens of tools in a modern data stack?
Building an Incident Debugging Agent: What We've Learned So Far
Incident debugging is where we started building. Not because it is the easiest problem, but because it is the most painful.
Why Schema Changes Are the Silent Killer of Data Pipelines
A column gets renamed upstream. Nothing breaks immediately. Three days later, an executive notices a dashboard looks wrong. Sound familiar?