What Is a Data Contract? Schema + SLA as Code
What Is a Data Contract? Schema + SLA as Code
Written by The Data Workers Team — 14 autonomous agents shipping production data infrastructure since 2026.
Technically reviewed by the Data Workers engineering team.
Last updated .
A data contract is a formal agreement between a data producer and consumer that specifies the schema, SLA, and semantics of a data product. Contracts are versioned in git, enforced in CI, and monitored in production — giving consumers confidence that upstream changes will not break their dashboards without warning.
Data contracts are the most important idea in data engineering in the last five years. They solve the root cause of most pipeline incidents: silent schema changes that break downstream consumers. This guide walks through what a contract is, how it works, and why every serious data stack should adopt them.
The term "data contract" was coined around 2022 by practitioners at Convoy, GoCardless, and PayPal who were tired of the same schema-incident pattern happening every week. The idea borrowed directly from API design: service-to-service communication uses versioned API contracts, so data-to-data communication should too. Once the framing clicked, the pattern spread across the industry within about two years.
What a Contract Specifies
A data contract names the producer and consumer, lists the schema (columns, types, nullability, descriptions), sets the SLA (freshness, volume, availability), and documents the semantics (what each column means, what filters apply, what business rules hold). It is the complete interface between two teams.
| Section | Example |
|---|---|
| Producer | growth-team |
| Consumer | finance-mart |
| Table | fct_orders |
| Schema | order_id STRING, revenue NUMERIC(18,2), order_date DATE |
| SLA | Refresh every 15 min, 99.5% uptime |
| Semantics | Revenue excludes refunds; orders from test accounts filtered |
Why Contracts Exist
Before contracts, upstream teams changed schemas freely and downstream teams found out when their dashboards broke. A rename, a type change, or a dropped column cascaded into incidents. Contracts flip the relationship: producers commit to stability, and breaking changes require explicit coordination.
Teams that adopt contracts stop firefighting schema incidents within a quarter. The change in on-call volume is usually dramatic and immediate.
The coordination cost of a contract might sound high, but the alternative is worse. Without contracts, every schema change is an implicit coordination problem: producer ships, consumer breaks, team pages someone, everyone jumps in Slack, root cause analysis reveals nobody knew the change was coming. Contracts move that coordination forward in time, where it is cheap, instead of waiting until production, where it is expensive.
Contract Lifecycle
A contract has a full lifecycle, not just a definition. Teams that forget about later stages (evolution, retirement, enforcement drift) end up with contracts that ship once and rot. The five stages below should each have tooling and process, or the program loses momentum within a quarter.
- •Define — write the contract as code (YAML, Protobuf, Avro)
- •Version — store in git with semantic versioning
- •Enforce in CI — fail PRs that violate the contract
- •Monitor — alert on runtime violations
- •Evolve — coordinate breaking changes via version bumps
Contract Formats
Common formats include Protobuf (strong typing, binary efficient, schema registry support), Avro (JSON-friendly, schema evolution rules), and custom YAML (human-readable, tool-agnostic). Protobuf is the default for streaming and event-driven systems; YAML and dbt contracts dominate warehouse-side definitions.
The format matters less than the enforcement. A contract written in the fanciest format on the planet does nothing if CI does not check it. Pick whatever format your team can write, review, and automate on. For most warehouse-centric teams, that means dbt contracts in YAML — they live alongside the dbt models that produce the data, review happens in the same PR, and dbt itself enforces them on compile.
Enforcement in Practice
A contract is only valuable when it is enforced. CI should check that the produced schema matches the contract. Runtime monitors should alert on violations. Catalog tools should surface contracts to consumers. Without enforcement, contracts rot into outdated docs within months.
Enforcement has two layers — static and runtime. Static enforcement happens in CI, comparing the produced schema against the contract on every PR. Runtime enforcement happens in production, comparing the actual data against the contract's constraints (type, nullability, ranges, freshness) on every refresh. Both are needed: static catches schema drift before it ships, runtime catches data drift that slipped past static checks. Teams that implement only one eventually find gaps the other would have caught.
For related topics see how to implement data contracts and how to handle schema evolution.
Contracts and AI
Contracts also help AI assistants. A well-defined contract tells the AI exactly what columns exist, what they mean, and how to use them — no more hallucinated joins or misapplied filters. Data Workers governance agents expose contracts as MCP tools so Claude, Cursor, and ChatGPT can query against guaranteed schemas.
Book a demo to see contract-aware AI and autonomous contract enforcement.
Real-World Examples
A fintech has contracts covering every table that feeds the finance dashboard: fct_transactions, dim_customer, fct_daily_balance. Each contract is reviewed quarterly and enforced on every PR. A SaaS company has contracts on the 20 tables that feed investor-facing metrics — MRR, churn, cohort retention — and nothing else yet. They are expanding coverage over time. A healthcare company has strict contracts on every table containing PHI, with runtime monitors alerting on any field outside its declared range.
When You Need It
You need contracts when schema incidents become a recurring pattern and on-call is suffering. Signs include: "the dashboard broke again" more than once a quarter, PR reviews where nobody knows who depends on a column, and finance calling because last month's MRR shifted without anyone understanding why. Any one of these means contracts would pay back quickly.
Common Misconceptions
Contracts are not forever-frozen schemas — they are versioned agreements that can evolve through explicit releases. Contracts are not bureaucracy — good tooling makes enforcement automatic, with no extra meetings required. And contracts do not require a schema registry — many teams successfully run contracts on just dbt plus CI, and only add a registry if they move into streaming.
A data contract is the formal interface between data producer and consumer. It specifies schema, SLA, and semantics, lives in git, and is enforced in CI and production. Adopt contracts and schema incidents stop waking you up at 3am. They are the single highest-leverage pattern in modern data engineering.
Further Reading
Sources
See Data Workers in action
15 autonomous AI agents working across your entire data stack. MCP-native, open-source, deployed in minutes.
Book a DemoRelated Resources
- What is Data Observability? The Data Engineer's Complete Guide — Data observability provides visibility into data health across your stack. This guide covers the five pillars, tool landscape, and how AI…
- Meta Data Meaning: Definition, Examples, and Why It Matters — Plain-language definition of meta data with examples and use cases for analysts, engineers, auditors, and AI agents.
- What Is Data Governance With Example: A Practical Guide — Real-world data governance examples from healthcare PHI, banking BCBS 239, and ecommerce GDPR with shared design principles.
- What Is Data Modernization? A 2026 Strategy Guide — Strategy guide covering the four phases of data modernization, common pitfalls, and how to make data AI-ready in 2026.
- What Is a Data Domain? Definition and Examples for Data Mesh — Guide to identifying data domains, using them in data mesh, and applying domain ownership in centralized stacks.
- What Is Data Transparency? Definition and Best Practices — Guide to data transparency including the five characteristics of transparent systems and how AI-native catalogs make transparency automatic.
- What Is Spatial Data? Definition, Types, and Examples — Spatial data primer covering vector vs raster types, common formats, spatial queries in modern warehouses, and quality issues.
- What Is Stale Data? Definition, Detection, and Prevention — Guide to identifying, detecting, and preventing stale data in pipelines with SLA contracts and active monitoring strategies.
- What Is Data Enablement? Definition and Strategy Guide — Strategy guide for data enablement programs covering access, literacy, trust, and tooling pillars.
- What Is a Data Pipeline? Complete 2026 Guide — Defines data pipelines and walks through the three stages, batch vs streaming, and modern tooling.
- What Is a Data Warehouse? Cloud Warehouse Guide — Explains what a data warehouse is, how cloud warehouses changed the category, and the modern platform choices.
- What Is a Data Lake? Modern Lakehouse Guide — Explains data lakes, lake vs warehouse tradeoffs, and the lakehouse evolution with Iceberg and Delta.
Explore Topic Clusters
- Data Governance: The Complete Guide — Policies, access controls, PII, and compliance at scale.
- Data Catalog: The Complete Guide — Discovery, metadata, lineage, and the modern catalog stack.
- Data Lineage: The Complete Guide — Column-level lineage, impact analysis, and observability.
- Data Quality: The Complete Guide — Tests, SLAs, anomaly detection, and data reliability engineering.
- AI Data Engineering: The Complete Guide — LLMs, agents, and autonomous workflows across the data stack.
- MCP for Data: The Complete Guide — Model Context Protocol servers, tools, and agent integration.
- Data Mesh & Data Fabric: The Complete Guide — Federated ownership, domain-oriented architecture, and interop.
- Open-Source Data Stack: The Complete Guide — dbt, Airflow, Iceberg, DuckDB, and the modern OSS toolkit.