Engineering10 min read

Building a Quality Monitoring Agent: Lessons From Alert Fatigue

How we are trying to fix the signal-to-noise problem in data quality

By The Data Workers Team

Alert fatigue is the silent killer of data quality programs. You deploy a quality tool, configure monitors, and within weeks you are drowning in alerts — 40-60% of which are false positives or low-priority noise. Engineers start ignoring alerts. Real issues get missed.

The cost is measurable: teams we talked to report that 40-60% of their on-call time goes to triaging alerts that turn out to be noise. That is senior engineering talent — $200K+ fully loaded — spent on work that follows the same patterns week after week. The quality tools are doing their job (detecting). The problem is everything that happens after detection: prioritization, correlation, root cause analysis, and remediation.

What the Agent Does

The Quality Monitoring Agent runs continuously and performs four functions:

  • Anomaly detection with context. Standard statistical monitoring but correlated with known events. If an upstream system had scheduled maintenance, the agent suppresses the downstream freshness alerts.
  • Impact-weighted prioritization. Not all tables are equal. A quality issue on a table that feeds your CFO's revenue dashboard is more important than an issue on an internal experimentation table.
  • Alert correlation. When 12 alerts fire in a 10-minute window, they are usually 1-2 root causes with multiple symptoms. The agent groups correlated alerts and presents them as incidents.
  • Quality scoring. Each monitored asset gets a composite quality score across freshness, completeness, consistency, accuracy, and conformity dimensions.

What We Learned About Auto-Remediation

  • Retry is the easy case. If a pipeline failed due to a transient error, retrying is usually safe. Maybe 20-25% of incidents fall into this category.
  • Backfill is dangerous. Automatically backfilling data sounds simple until the backfill overwrites corrected data or triggers expensive full-table recomputation.
  • Default values are almost never right. Replacing nulls with defaults is a business logic decision, not a data engineering decision.

Our current approach: the agent can auto-remediate retries (with configurable limits). Everything else gets a recommended action with a one-click approval. The human decides. The agent executes.

Honest Failures

  • Seasonality blindness. Our anomaly detection initially flagged every weekend dip as an anomaly. Adding day-of-week patterns helped, but we still get false positives around holidays and campaigns.
  • The cold start problem. The agent needs historical data to establish baselines. For new tables, the initial monitoring period generates unreliable alerts.
  • Custom business rules. Every company has data quality rules specific to their business. The agent can learn these from examples, but defining them is still manual work.
  • Quality score calibration. Our quality scores are useful for relative comparison but the absolute numbers are not yet meaningful.

Working prototypes and active design partner conversations. A lot of honest lessons about what agents can and cannot do in production data environments.

Related Posts