Building a Quality Monitoring Agent: Lessons From Alert Fatigue
How we are trying to fix the signal-to-noise problem in data quality
By The Data Workers Team
Alert fatigue is the silent killer of data quality programs. You deploy a quality tool, configure monitors, and within weeks you are drowning in alerts — 40-60% of which are false positives or low-priority noise. Engineers start ignoring alerts. Real issues get missed.
The cost is measurable: teams we talked to report that 40-60% of their on-call time goes to triaging alerts that turn out to be noise. That is senior engineering talent — $200K+ fully loaded — spent on work that follows the same patterns week after week. The quality tools are doing their job (detecting). The problem is everything that happens after detection: prioritization, correlation, root cause analysis, and remediation.
What the Agent Does
The Quality Monitoring Agent runs continuously and performs four functions:
- •Anomaly detection with context. Standard statistical monitoring but correlated with known events. If an upstream system had scheduled maintenance, the agent suppresses the downstream freshness alerts.
- •Impact-weighted prioritization. Not all tables are equal. A quality issue on a table that feeds your CFO's revenue dashboard is more important than an issue on an internal experimentation table.
- •Alert correlation. When 12 alerts fire in a 10-minute window, they are usually 1-2 root causes with multiple symptoms. The agent groups correlated alerts and presents them as incidents.
- •Quality scoring. Each monitored asset gets a composite quality score across freshness, completeness, consistency, accuracy, and conformity dimensions.
What We Learned About Auto-Remediation
- •Retry is the easy case. If a pipeline failed due to a transient error, retrying is usually safe. Maybe 20-25% of incidents fall into this category.
- •Backfill is dangerous. Automatically backfilling data sounds simple until the backfill overwrites corrected data or triggers expensive full-table recomputation.
- •Default values are almost never right. Replacing nulls with defaults is a business logic decision, not a data engineering decision.
Our current approach: the agent can auto-remediate retries (with configurable limits). Everything else gets a recommended action with a one-click approval. The human decides. The agent executes.
Honest Failures
- •Seasonality blindness. Our anomaly detection initially flagged every weekend dip as an anomaly. Adding day-of-week patterns helped, but we still get false positives around holidays and campaigns.
- •The cold start problem. The agent needs historical data to establish baselines. For new tables, the initial monitoring period generates unreliable alerts.
- •Custom business rules. Every company has data quality rules specific to their business. The agent can learn these from examples, but defining them is still manual work.
- •Quality score calibration. Our quality scores are useful for relative comparison but the absolute numbers are not yet meaningful.
Working prototypes and active design partner conversations. A lot of honest lessons about what agents can and cannot do in production data environments.
Related Posts
What Ralph Kimball's Dimensional Modeling Taught Our Pipelines Agent
Ralph Kimball's four-step dimensional design process is one of the most durable ideas in data engineering — here is what it taught our pipelines agent.
What Jay Kreps's Log-Centric Architecture Taught Our Streaming Agent
Jay Kreps's core insight is deceptively simple: an append-only, totally-ordered log is not just a message bus — it is the single source of truth that eliminates N² integration pipelines and makes reprocessing routine. We studied his published writing and built a reusable streaming skill around the method.
What W. Edwards Deming's Plan-Do-Study-Act Taught Our Data Quality Agent
W. Edwards Deming spent a career arguing that quality comes from improving the process, not inspecting for defects. His Plan-Do-Study-Act cycle is the most rigorous improvement loop in the field. Here is how we encoded it into our data quality agent.