guide4 min read

How to Test Data Pipelines: Schema, Data, Integration

How to Test Data Pipelines: Schema, Data, Integration

Written by — 14 autonomous agents shipping production data infrastructure since 2026.

Technically reviewed by the Data Workers engineering team.

Last updated .

To test data pipelines: write schema tests (uniqueness, not-null, foreign keys), data tests (business rules, row counts, anomalies), and integration tests (end-to-end flow). Run them in CI on every pull request and in production after every run. Tooling like dbt tests, Great Expectations, and Soda make this cheap — the hard part is discipline.

Pipelines without tests are production incidents waiting to happen. This guide walks through the three test categories every pipeline needs and the CI patterns that catch regressions before they hit customer dashboards.

The Three Categories of Pipeline Tests

Good pipeline testing has three layers: schema tests (structural), data tests (semantic), and integration tests (flow). Each catches a different class of bug. Skipping any one creates blind spots that will eventually break production.

The three layers should run in different places. Schema tests run on every PR (cheap, fast). Data tests run on every production pipeline run (catch real-world drift). Integration tests run nightly or on demand (heavier, slower, catch cross-model bugs). Separating them by frequency and location keeps CI fast while still catching bugs at every layer.

CategoryWhat It CatchesExample
Schema testsStructural driftprimary key must be unique
Data testsSemantic correctnessrevenue >= 0, dates in order
Integration testsEnd-to-end flowraw → staging → mart produces N rows
Anomaly testsStatistical driftrow count within historical range
Performance testsRuntime regressionmodel under 10 min

Schema Tests: The Baseline

Every mart table needs schema tests: a uniqueness test on the primary key, not-null tests on required columns, and referential integrity tests on foreign keys. dbt ships these for free. Add them as you build each model — not after the first bug.

Schema tests catch ~60% of pipeline regressions with 5 minutes of setup per model. The return on investment is unmatched by any other engineering practice.

dbt also ships accepted_values tests for categorical columns, which catch invalid category values silently introduced by source system changes. Combine uniqueness, not-null, foreign keys, and accepted_values for any mart table and you have ~75% of the schema-drift bug class eliminated with minimal effort.

Data Tests: Business Rules

Data tests encode business rules that cannot be inferred from schema alone: revenue must be non-negative, order_date must be before ship_date, customer status must be in a known set. Write these as SQL assertions and run them alongside schema tests.

The highest-value business rule tests are reconciliation checks: does the sum of line-item revenue match the order total? Does the count of active users in the fact table match the count from the source system? These catch silent correctness bugs that no schema test would find, and they are worth writing for any table that feeds financial or executive dashboards.

  • Range checks — values within expected min/max
  • Set checks — categorical values match a known list
  • Cross-field checks — start_date < end_date
  • Aggregate checks — daily revenue roughly matches source
  • Freshness checks — data newer than SLA

Integration Tests: End-to-End Flow

Integration tests run the entire pipeline against a fixed sample of data and check that the final output matches expectations. They catch bugs that slip through per-model tests — joins with wrong cardinality, silent aggregation errors, misapplied filters. Run them in CI with a seeded dataset and zero-copy clone.

For related topics see how to debug a data pipeline and how to implement data quality.

CI/CD Integration

Tests should run on every pull request, blocking merges when they fail. The typical flow: PR opened → CI creates a warehouse clone → dbt build runs models → dbt test runs tests → results posted back to PR. If any test fails, merge is blocked until fixed.

For CI setup patterns see how to version a data warehouse.

Production Testing

CI tests are necessary but not sufficient. Production pipelines must also run tests on every scheduled run and alert on failures. Data Workers pipeline agents automate this end to end — running tests, diagnosing failures, writing fix PRs, and rolling back bad deploys.

Tests should also be monitored for quality: a test that never fails is either perfectly specified or completely dead weight. Track test pass rates over time and review tests that have not fired in 6 months — either delete them or tighten their thresholds so they catch real regressions.

Common Mistakes

The biggest mistake is writing one giant test that checks everything. A test that combines schema, data, and business rule assertions produces cryptic failures that are hard to diagnose. Split tests into focused, single-purpose checks so failures tell you exactly what went wrong.

The second biggest is testing the wrong layer. Testing staging tables catches upstream drift; testing marts catches business logic bugs. Skipping either layer leaves a blind spot. The teams with the fewest production incidents test at every layer, not just the final output.

Tools You Will Need

  • dbt tests — schema tests, data tests, generic tests
  • dbt_expectations — Great Expectations-style assertions in dbt
  • Great Expectations — standalone quality framework
  • Soda Core / Soda Cloud — no-code quality rules
  • Datafold data-diff — row-level PR diffs
  • Elementary — dbt test observability and anomaly detection

Test Coverage Targets

Aim for 100% of mart tables to have primary key uniqueness, not-null on required columns, and referential integrity tests. Aim for 80%+ of mart tables to have business rule tests. Track coverage in a dashboard so gaps are visible. Coverage targets create accountability and prevent tests from being skipped as "not worth it."

Coverage is a floor, not a ceiling. A table with 100% test coverage can still ship bad data if the tests are the wrong tests. Review test design quarterly with a senior engineer to catch patterns that look like coverage but miss real risks.

Book a demo to see autonomous pipeline testing in action.

Testing data pipelines is a three-layer discipline: schema, data, and integration tests, run in both CI and production. Use dbt or Great Expectations for execution and tie everything to ownership and alerting. The pipelines that never wake you up at 3am are the ones with honest test coverage.

See Data Workers in action

15 autonomous AI agents working across your entire data stack. MCP-native, open-source, deployed in minutes.

Book a Demo

Related Resources

Explore Topic Clusters