What are data quality checks and data contracts, and how do you enforce them in a modern data stack?
Data quality checks assert that datasets meet defined expectations — completeness, uniqueness, referential integrity, value ranges — and fail the pipeline or alert when they do not. Data contracts are formal, version-controlled agreements between data producers and consumers specifying schema, semantics, and SLAs, preventing silent breaking changes from propagating downstream.
How to think about it
Bad data that silently reaches dashboards costs more to remediate than bad data caught at ingestion. Quality checks and contracts shift detection upstream, as close to the source as possible.
Data quality dimensions
| Dimension | Example check |
|---|---|
| Completeness | NOT NULL rate for required columns exceeds 99% |
| Uniqueness | No duplicate order_id values |
| Freshness | Latest created_at is within 2 hours of now |
| Referential integrity | Every customer_id in orders exists in customers |
| Distribution | Revenue per day within 3 standard deviations of 30-day average |
dbt tests — inline quality checks
dbt’s built-in and dbt-expectations package tests run after each model build:
models:
- name: stg_orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: status
tests:
- accepted_values:
values: [pending, shipped, delivered, cancelled]
- name: amount_usd
tests:
- dbt_expectations.expect_column_values_to_be_between:
min_value: 0
max_value: 100000
Failing tests can be configured to block deployment (severity: error) or emit warnings (severity: warn).
Great Expectations — standalone quality layer
For non-dbt pipelines, Great Expectations (or Soda) defines expectation suites that run against any DataFrame or SQL result:
import great_expectations as gx
context = gx.get_context()
suite = context.add_expectation_suite("orders_suite")
suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(column="order_id"))
suite.add_expectation(gx.expectations.ExpectColumnValuesToBeBetween(column="amount_usd", min_value=0))
results = context.run_checkpoint("orders_checkpoint")
assert results.success, "Data quality check failed — aborting pipeline"
Data contracts
A data contract is a YAML or JSON document owned by the producer team that specifies:
- Schema (column names, types, nullability)
- Semantics (what
revenue_usdmeans: gross or net?) - Freshness SLA (updated every 15 minutes)
- Breaking-change notification policy
Tools like Soda Data Contracts, dbt contracts (enforced: true), or even a plain YAML file in the producer’s repository serve this purpose. CI pipelines validate that proposed schema changes do not violate the declared contract before merge.
# data-contract.yaml (producer-owned)
dataset: orders
version: "2.1"
columns:
- name: order_id
type: string
nullable: false
- name: revenue_usd
type: decimal(18,2)
nullable: false
description: "Net revenue after refunds, in USD"
sla:
freshness_minutes: 15