Why We Open-Sourced 14 Autonomous Data Engineering Agents
Trust requires transparency. Here is how we designed an autonomous agent swarm for open-source-first production use.
By The Data Workers Team
Today we released the community edition of Data Workers: 14 autonomous agents for data engineering, open-sourced under Apache 2.0. This post explains why we made that decision, how the trust model works, and what we are looking for from the community.
What Ships in the Open
The community edition includes 14 agents covering the core data engineering lifecycle: incident debugging, quality monitoring, schema evolution, pipeline construction, data context and cataloging, governance and security, cost optimization, data migration, insights and analytics, streaming operations, orchestration coordination, connector management, observability, and usage intelligence.
In concrete terms, that is 202+ MCP tools, 15 catalog connectors (Snowflake, Databricks, BigQuery, Unity Catalog, Hive Metastore, and more), and 3,000+ passing tests. Every agent has its own MCP server. Every tool call is auditable.
A 15th agent for ML model monitoring ships as enterprise-only, alongside 35 additional enterprise connectors and features like PII detection middleware, tamper-evident audit trails, and OAuth 2.1 authentication. The community edition has no feature gates on the 14 agents it includes.
Why Open Source?
The short answer: because black-box agents and critical data infrastructure do not mix.
When an agent modifies your Airflow DAGs, evolves a schema in production, or recommends dropping an unused table that turns out to be consumed by a downstream team you did not know about, you need to understand exactly what logic drove that decision. You need to read the code. You need to audit the tool calls. You need to verify the reasoning.
That requirement is not compatible with a closed-source product. We considered offering a hosted-only service and rejected it. Data engineers are rightfully skeptical of autonomous systems they cannot inspect. We would be too.
- •Vendor lock-in compounds over time. Once an agent manages your pipeline configurations, incident response, and governance policies, switching costs become prohibitive. Your operational knowledge lives in a system you do not own.
- •Customization hits walls. Every data environment is different. When a proprietary agent does not handle your specific migration pattern, you file a feature request and wait. With open source, you fix it yourself.
- •Audit requirements grow. Regulated industries need to demonstrate exactly how autonomous systems make decisions. Reading the actual source code satisfies auditors in a way that vendor assurances do not.
- •Incident response is blind. When a proprietary agent makes a bad decision at 2 AM, your on-call engineer cannot read the code to understand what happened.
Because it is Apache 2.0, your investment is protected even if we disappear tomorrow. Fork it, modify it, run it in production indefinitely. The license guarantees that.
The Trust Model: Read-Only by Default
Every agent in the swarm is designed to operate in read-only mode by default. Agents observe, diagnose, and recommend. They do not take write actions unless you explicitly opt in.
This is a deliberate architectural decision, not a temporary limitation. The trust model works in three tiers:
- •Observe. Agents connect to your data stack, read metadata, trace lineage, and surface findings. No write access required.
- •Recommend. Based on observations, agents propose specific actions: fix this query, evolve this schema, drop this unused table. Each recommendation includes the reasoning chain and the tool calls that produced it.
- •Act (opt-in only). With explicit configuration, agents can execute approved action types autonomously. Human approval gates are available for every write operation. You control exactly how much autonomy each agent gets.
Every MCP tool in the system is tagged as either a READ or WRITE operation. Write tools are disabled by default and require explicit enablement per agent, per environment.
What This Looks Like in Practice
Consider a production incident: a key pipeline fails at 2 AM. Without Data Workers, an on-call engineer wakes up, checks Airflow logs, traces the failure upstream through dbt, queries Snowflake to find the root cause, and manually applies a fix. That typically takes 30 to 90 minutes on a good night.
With the community edition, the incident agent detects the failure, traces lineage across tools, identifies root cause, and presents a diagnosis with full evidence. The agent shows you exactly what it found, what it checked, and what it recommends — designed to compress that diagnosis from an hour to minutes.
The community edition tells you the root cause. The Pro tier lets the agent automatically apply the fix and rerun the pipeline, with approval gates you configure. That is the upgrade path: not gated features on the same work, but additional autonomy on top of full transparency.
How We Built It
The architecture is MCP-first. Each agent runs its own MCP server, exposing tools that other agents and external clients can call. Agents coordinate through shared context rather than a centralized orchestrator.
- •14 specialized agents, each focused on one domain of data engineering
- •202+ MCP tools across all agents, with clear READ/WRITE separation
- •15 catalog connectors for cross-platform data discovery
- •Factory-pattern infrastructure that auto-detects real services from environment variables and falls back to in-memory stubs for local development
- •3,000+ tests covering tool functionality, agent coordination, and edge cases
We spent 12 months on research and development before this release. The agent designs are grounded in real data engineering workflows, not hypothetical use cases. That said, we are early stage and honest about it. These agents are designed to handle production scenarios, but they have not yet been battle-tested across hundreds of environments. That is what the next phase is for.
The Business Model
The community edition is free and fully functional for the 14 agents it includes. The Pro and Enterprise tiers add operational autonomy (write actions, automated remediation), the 15th ML monitoring agent, 35 additional enterprise connectors, PII detection, tamper-evident audit logs, OAuth 2.1 authentication, and dedicated support.
The line is straightforward: transparency and diagnosis are free. Autonomy and enterprise security are paid.
We Are Looking for Design Partners
We are looking for design partners to validate these agents in real environments. If you run a data stack with more than a few pipelines and have experienced the 2 AM incident, the schema change that broke downstream consumers, or the warehouse bill that quietly doubled, we want to work with you.
What design partners get: direct access to the engineering team, influence on the roadmap, early access to Pro features during the validation period, and the knowledge that the agents are being shaped by your real-world requirements.
What we get: honest feedback on what works, what does not, and what we missed.
Clone the repo: github.com/DataWorkersProject/dataworkers-claw-community
Join the community: discord.com/invite/b8DR5J53
Read the docs and pricing: dataworkers.io
We built this in the open because we believe that is the only way autonomous agents earn trust in production. Read the code. Tell us what is wrong. Help us make it better.
Related Posts
Why We Bet on MCP (And What We're Still Figuring Out)
When we started building Data Workers, we had to make a foundational decision: how do our AI agents connect to the dozens of tools in a modern data stack?
Building an Incident Debugging Agent: What We've Learned So Far
Incident debugging is where we started building. Not because it is the easiest problem, but because it is the most painful.
Building a Quality Monitoring Agent: Lessons From Alert Fatigue
Alert fatigue is the silent killer of data quality programs. You deploy a quality tool, configure monitors, and within weeks you are drowning in alerts.