guideLast updated Apr 10, 20264 min read

How to Test Data Pipelines: Schema, Data, Integration

To test data pipelines: write schema tests (uniqueness, not-null, foreign keys), data tests (business rules, row counts, anomalies), and integration tests (end-to-end flow). Run them in CI on every pull request and in production after every run. Tooling like dbt tests, Great Expectations, and Soda make this cheap — the hard part is discipline.

Pipelines without tests are production incidents waiting to happen. This guide walks through the three test categories every pipeline needs and the CI patterns that catch regressions before they hit customer dashboards.

The Three Categories of Pipeline Tests

Good pipeline testing has three layers: schema tests (structural), data tests (semantic), and integration tests (flow). Each catches a different class of bug. Skipping any one creates blind spots that will eventually break production.

The three layers should run in different places. Schema tests run on every PR (cheap, fast). Data tests run on every production pipeline run (catch real-world drift). Integration tests run nightly or on demand (heavier, slower, catch cross-model bugs). Separating them by frequency and location keeps CI fast while still catching bugs at every layer.

Category	What It Catches	Example
Schema tests	Structural drift	primary key must be unique
Data tests	Semantic correctness	revenue >= 0, dates in order
Integration tests	End-to-end flow	raw → staging → mart produces N rows
Anomaly tests	Statistical drift	row count within historical range
Performance tests	Runtime regression	model under 10 min

Schema Tests: The Baseline

Every mart table needs schema tests: a uniqueness test on the primary key, not-null tests on required columns, and referential integrity tests on foreign keys. dbt ships these for free. Add them as you build each model — not after the first bug.

Schema tests catch ~60% of pipeline regressions with 5 minutes of setup per model. The return on investment is unmatched by any other engineering practice.

dbt also ships accepted_values tests for categorical columns, which catch invalid category values silently introduced by source system changes. Combine uniqueness, not-null, foreign keys, and accepted_values for any mart table and you have ~75% of the schema-drift bug class eliminated with minimal effort.

Data Tests: Business Rules

Data tests encode business rules that cannot be inferred from schema alone: revenue must be non-negative, order_date must be before ship_date, customer status must be in a known set. Write these as SQL assertions and run them alongside schema tests.

The highest-value business rule tests are reconciliation checks: does the sum of line-item revenue match the order total? Does the count of active users in the fact table match the count from the source system? These catch silent correctness bugs that no schema test would find, and they are worth writing for any table that feeds financial or executive dashboards.

•Range checks — values within expected min/max
•Set checks — categorical values match a known list
•Cross-field checks — start_date < end_date
•Aggregate checks — daily revenue roughly matches source
•Freshness checks — data newer than SLA

Integration Tests: End-to-End Flow

Integration tests run the entire pipeline against a fixed sample of data and check that the final output matches expectations. They catch bugs that slip through per-model tests — joins with wrong cardinality, silent aggregation errors, misapplied filters. Run them in CI with a seeded dataset and zero-copy clone.

For related topics see how to debug a data pipeline and how to implement data quality.

CI/CD Integration

Tests should run on every pull request, blocking merges when they fail. The typical flow: PR opened → CI creates a warehouse clone → dbt build runs models → dbt test runs tests → results posted back to PR. If any test fails, merge is blocked until fixed.

For CI setup patterns see how to version a data warehouse.

Production Testing

CI tests are necessary but not sufficient. Production pipelines must also run tests on every scheduled run and alert on failures. Data Workers pipeline agents automate this end to end — running tests, diagnosing failures, writing fix PRs, and rolling back bad deploys.

Tests should also be monitored for quality: a test that never fails is either perfectly specified or completely dead weight. Track test pass rates over time and review tests that have not fired in 6 months — either delete them or tighten their thresholds so they catch real regressions.

Common Mistakes

The biggest mistake is writing one giant test that checks everything. A test that combines schema, data, and business rule assertions produces cryptic failures that are hard to diagnose. Split tests into focused, single-purpose checks so failures tell you exactly what went wrong.

The second biggest is testing the wrong layer. Testing staging tables catches upstream drift; testing marts catches business logic bugs. Skipping either layer leaves a blind spot. The teams with the fewest production incidents test at every layer, not just the final output.

Tools You Will Need

•dbt tests — schema tests, data tests, generic tests
•dbt_expectations — Great Expectations-style assertions in dbt
•Great Expectations — standalone quality framework
•Soda Core / Soda Cloud — no-code quality rules
•Datafold data-diff — row-level PR diffs
•Elementary — dbt test observability and anomaly detection

Test Coverage Targets

Aim for 100% of mart tables to have primary key uniqueness, not-null on required columns, and referential integrity tests. Aim for 80%+ of mart tables to have business rule tests. Track coverage in a dashboard so gaps are visible. Coverage targets create accountability and prevent tests from being skipped as "not worth it."

Coverage is a floor, not a ceiling. A table with 100% test coverage can still ship bad data if the tests are the wrong tests. Review test design quarterly with a senior engineer to catch patterns that look like coverage but miss real risks.

Book a demo to see autonomous pipeline testing in action.

Testing data pipelines is a three-layer discipline: schema, data, and integration tests, run in both CI and production. Use dbt or Great Expectations for execution and tie everything to ownership and alerting. The pipelines that never wake you up at 3am are the ones with honest test coverage.

See Data Workers in action

15 autonomous AI agents working across your entire data stack. MCP-native, open-source, deployed in minutes.

Book a Demo

Related Resources

ETL vs ELT: Key Differences — Google Cloud — external reference
Best Practices for Claude Code in Data Pipelines — Discover effective practices for optimizing Claude Code in your data pipelines with our detailed listicle format.
Claude Managed Agents for Data Pipelines: From Prototype to Production in Days — Claude Managed Agents (April 2026) handles orchestration and long-running execution. Combined with Data Workers MCP servers, go from prot…
AI Makes Tons of Mistakes in Data Pipelines: How to Build Guardrails — Reddit's top concern: AI makes mistakes. Build guardrails with validation layers, human approval, and rollback.
Self-Healing Data Pipelines: How AI Agents Fix Broken Pipelines Before You Wake Up — Self-healing data pipelines use AI agents to detect failures, diagnose root causes, and apply fixes autonomously — resolving 60-70% of in…
Building Data Pipelines for LLMs: Chunking, Embedding, and Vector Storage — Building data pipelines for LLMs requires new skills: document chunking, embedding generation, vector storage, and retrieval optimization…
Generative AI for Data Pipelines: When AI Writes Your ETL — Generative AI is writing data pipelines: generating transformation code, creating test suites, writing documentation, and configuring dep…
Real-Time Data Pipelines for AI: Stream Processing Meets Agentic Systems — Real-time data pipelines for AI agents combine stream processing (Kafka, Flink) with autonomous agent systems — enabling agents to act on…
How to Monitor Data Pipelines: Five Signals That Matter — Covers the five signals every pipeline should emit and the alerting patterns that keep noise low.
How to Handle PII in Data Pipelines (GDPR + CCPA) — A six-step PII handling playbook for modern data pipelines and compliance requirements.
Agent Memory For Data Pipelines — Agent Memory For Data Pipelines
Memory Pipelines For Data Agents — Memory Pipelines For Data Agents
Monitoring Ai Agent Data Pipelines — Monitoring Ai Agent Data Pipelines

Explore Topic Clusters

Data Governance: The Complete Guide — Policies, access controls, PII, and compliance at scale.
Data Catalog: The Complete Guide — Discovery, metadata, lineage, and the modern catalog stack.
Data Lineage: The Complete Guide — Column-level lineage, impact analysis, and observability.
Data Quality: The Complete Guide — Tests, SLAs, anomaly detection, and data reliability engineering.
AI Data Engineering: The Complete Guide — LLMs, agents, and autonomous workflows across the data stack.
MCP for Data: The Complete Guide — Model Context Protocol servers, tools, and agent integration.
Data Mesh & Data Fabric: The Complete Guide — Federated ownership, domain-oriented architecture, and interop.
Open-Source Data Stack: The Complete Guide — dbt, Airflow, Iceberg, DuckDB, and the modern OSS toolkit.
AI for Data Infra — The complete category for AI agents built specifically for data engineering, data governance, and data infrastructure work.