datarekha

Data contracts & quality

Most ML failures are silent data failures. A data contract is an enforced agreement on a dataset's schema and semantics, so bad data fails the pipeline instead of poisoning the model.

6 min read Intermediate MLOps Lesson 4 of 28

What you'll learn

  • Why 'garbage in' is the top cause of silent ML failure
  • What a data contract is — schema + semantics, enforced at runtime
  • Where to enforce it, and the tools that do it

Before you start

The most common way an ML system fails in production isn’t a bad model — it’s bad data arriving silently. An upstream team renames a column, changes a unit from dollars to cents, or starts sending nulls, and your model keeps serving predictions, now quietly wrong. A data contract is the fix: an explicit, enforced agreement about what a dataset must look like.

Garbage in, wrong out — quietly

The danger of data quality issues is that they’re silent. A crashed pipeline gets noticed; a model that’s 8% less accurate because income is now in cents doesn’t trip any alarm — until the business metrics sag weeks later. The model can’t tell “corrupt data” from “the world changed.” You have to check the data before it reaches the model.

A contract: schema + semantics, enforced

A data contract is a producer–consumer agreement that goes beyond a schema:

producerupstream datacontract gateschema + semanticsmodel pipelinefail + alertvalidviolation
A contract gate validates incoming data; valid data proceeds, a violation fails the build and alerts the producer — before the model ever sees it.

A contract specifies: schema (columns, types), constraints (ranges, allowed values, non-null, uniqueness), and freshness/volume expectations. It’s then enforced at runtime — checked every time data flows — so a violation blocks the pipeline at the source instead of silently corrupting training or serving.

Quick check

Quick check

0/3
Q1Why are data-quality failures especially dangerous in ML?
Q2What does a data contract specify beyond column names and types?
Q3Where should a data contract be enforced?

Next

Data contracts feed clean data to training and serving. The last serving decision: batch vs real-time inference.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
What is a data contract, and how does it prevent ML pipelines from breaking silently?

A data contract is an explicit, enforced agreement between a data producer and consumers that specifies schema, types, semantics, and quality or freshness expectations, plus rules for how it can evolve. It prevents silent breakage by validating data at ingestion so violations are caught and quarantined or alerted instead of flowing into the model. Combined with a schema registry and backward-compatible evolution rules, it lets producers change data without unexpectedly corrupting downstream features and predictions.

What are data quality checks and data contracts, and how do you enforce them in a modern data stack?

Data quality checks assert that datasets meet defined expectations — completeness, uniqueness, referential integrity, value ranges — and fail the pipeline or alert when they do not. Data contracts are formal, version-controlled agreements between data producers and consumers specifying schema, semantics, and SLAs, preventing silent breaking changes from propagating downstream.

How does CI/CD for ML differ from standard software CI/CD, and what stages should an ML pipeline include?

ML CI/CD must validate not just code correctness but also model quality — automated retraining triggers, data validation, model evaluation gates, and canary deployment checks that standard software pipelines have no equivalent for. A regression in model AUC is as much a deployment failure as a 500 error.

How do you evolve a data schema without breaking downstream ML consumers?

Use a schema registry with backward-compatible evolution rules so changes are managed rather than ad hoc: producers can add optional or nullable fields and consumers ignore unknown fields, which keeps existing pipelines working. Breaking changes such as renaming, removing, or retyping a field require versioning, often a new topic or table, with a migration window and deprecation before the old schema is retired. This lets data evolve continuously while ML features and models stay stable.

Related lessons

Explore further

Skip to content