datarekha
Data Engineering Medium Asked at dbt LabsAsked at Great ExpectationsAsked at Monte CarloAsked at SodaAsked at Databricks

What are data quality checks and data contracts, and how do you enforce them in a modern data stack?

The short answer

Data quality checks assert that datasets meet defined expectations — completeness, uniqueness, referential integrity, value ranges — and fail the pipeline or alert when they do not. Data contracts are formal, version-controlled agreements between data producers and consumers specifying schema, semantics, and SLAs, preventing silent breaking changes from propagating downstream.

How to think about it

Bad data that silently reaches dashboards costs more to remediate than bad data caught at ingestion. Quality checks and contracts shift detection upstream, as close to the source as possible.

Data quality dimensions

DimensionExample check
CompletenessNOT NULL rate for required columns exceeds 99%
UniquenessNo duplicate order_id values
FreshnessLatest created_at is within 2 hours of now
Referential integrityEvery customer_id in orders exists in customers
DistributionRevenue per day within 3 standard deviations of 30-day average

dbt tests — inline quality checks

dbt’s built-in and dbt-expectations package tests run after each model build:

models:
  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: status
        tests:
          - accepted_values:
              values: [pending, shipped, delivered, cancelled]
      - name: amount_usd
        tests:
          - dbt_expectations.expect_column_values_to_be_between:
              min_value: 0
              max_value: 100000

Failing tests can be configured to block deployment (severity: error) or emit warnings (severity: warn).

Great Expectations — standalone quality layer

For non-dbt pipelines, Great Expectations (or Soda) defines expectation suites that run against any DataFrame or SQL result:

import great_expectations as gx

context = gx.get_context()
suite = context.add_expectation_suite("orders_suite")
suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(column="order_id"))
suite.add_expectation(gx.expectations.ExpectColumnValuesToBeBetween(column="amount_usd", min_value=0))
results = context.run_checkpoint("orders_checkpoint")
assert results.success, "Data quality check failed — aborting pipeline"

Data contracts

A data contract is a YAML or JSON document owned by the producer team that specifies:

  • Schema (column names, types, nullability)
  • Semantics (what revenue_usd means: gross or net?)
  • Freshness SLA (updated every 15 minutes)
  • Breaking-change notification policy

Tools like Soda Data Contracts, dbt contracts (enforced: true), or even a plain YAML file in the producer’s repository serve this purpose. CI pipelines validate that proposed schema changes do not violate the declared contract before merge.

# data-contract.yaml (producer-owned)
dataset: orders
version: "2.1"
columns:
  - name: order_id
    type: string
    nullable: false
  - name: revenue_usd
    type: decimal(18,2)
    nullable: false
    description: "Net revenue after refunds, in USD"
sla:
  freshness_minutes: 15

Keep practising

All Data Engineering questions

Explore further

Skip to content