How do you evolve a data schema without breaking downstream ML consumers?

Use a schema registry with backward-compatible evolution rules so changes are managed rather than ad hoc: producers can add optional or nullable fields and consumers ignore unknown fields, which keeps existing pipelines working. Breaking changes such as renaming, removing, or retyping a field require versioning, often a new topic or table, with a migration window and deprecation before the old schema is retired. This lets data evolve continuously while ML features and models stay stable.

What is a data contract, and how does it prevent ML pipelines from breaking silently?

A data contract is an explicit, enforced agreement between a data producer and consumers that specifies schema, types, semantics, and quality or freshness expectations, plus rules for how it can evolve. It prevents silent breakage by validating data at ingestion so violations are caught and quarantined or alerted instead of flowing into the model. Combined with a schema registry and backward-compatible evolution rules, it lets producers change data without unexpectedly corrupting downstream features and predictions.

What are data quality checks and data contracts, and how do you enforce them in a modern data stack?

Data quality checks assert that datasets meet defined expectations — completeness, uniqueness, referential integrity, value ranges — and fail the pipeline or alert when they do not. Data contracts are formal, version-controlled agreements between data producers and consumers specifying schema, semantics, and SLAs, preventing silent breaking changes from propagating downstream.

How does CI/CD for ML differ from standard software CI/CD, and what stages should an ML pipeline include?

ML CI/CD must validate not just code correctness but also model quality — automated retraining triggers, data validation, model evaluation gates, and canary deployment checks that standard software pipelines have no equivalent for. A regression in model AUC is as much a deployment failure as a 500 error.

Data contracts & quality — MLOps

The last lesson left us firefighting: finding corrupted data after it had already trained a worse model, then cleaning up downstream. We asked whether we could instead stop the bad data at the door — write down, in advance, exactly what a dataset is allowed to look like, and make the pipeline reject anything that violates it. This lesson is that idea, made enforceable.

The most common way an ML system fails in production isn’t a bad model — it’s bad data arriving silently. An upstream team renames a column, changes a unit from dollars to cents, or starts sending nulls, and your model keeps serving predictions, now quietly wrong. A data contract is the fix: an explicit, enforced agreement about what a dataset must look like.

Garbage in, wrong out — quietly

The danger of data quality issues is that they’re silent. A crashed pipeline gets noticed; a model that’s 8% less accurate because income is now in cents doesn’t trip any alarm — until the business metrics sag weeks later. The model can’t tell “corrupt data” from “the world changed.” You have to check the data before it reaches the model.

A contract: schema + semantics, enforced

A data contract is a producer–consumer agreement that goes beyond a schema:

A contract gate validates incoming data; valid data proceeds, a violation fails the build and alerts the producer — before the model ever sees it.

A contract specifies: schema (columns, types), constraints (ranges, allowed values, non-null, uniqueness), and freshness/volume expectations. It’s then enforced at runtime — checked every time data flows — so a violation blocks the pipeline at the source instead of silently corrupting training or serving.

import pandas as pd

# A minimal data contract for an incoming feature table.
CONTRACT = {
    "user_id": dict(dtype="int64", nullable=False, unique=True),
    "age":     dict(dtype="int64", nullable=False, min=0, max=120),
    "income":  dict(dtype="float64", nullable=False, min=0, max=1_000_000),
    "country": dict(dtype="object", allowed={"US", "UK", "IN", "DE"}),
}

def enforce(df, contract):
    errs = []
    for col, rule in contract.items():
        if col not in df: errs.append(f"missing column: {col}"); continue
        s = df[col]
        if rule.get("nullable") is False and s.isna().any(): errs.append(f"{col}: nulls")
        if "min" in rule and (s.dropna() < rule["min"]).any(): errs.append(f"{col}: below min")
        if "max" in rule and (s.dropna() > rule["max"]).any(): errs.append(f"{col}: above max")
        if "allowed" in rule and not set(s.dropna()) <= rule["allowed"]: errs.append(f"{col}: bad category")
        if rule.get("unique") and s.duplicated().any(): errs.append(f"{col}: duplicates")
    return errs

# Upstream sent income in CENTS by mistake (5.2M, 6.1M) — the contract catches it.
batch = pd.DataFrame({"user_id":[1,2], "age":[34,29], "income":[5_200_000.0, 6_100_000.0], "country":["US","UK"]})
violations = enforce(batch, CONTRACT)
print("contract:", "PASS" if not violations else f"BLOCKED -> {violations}")
print("\nThe model never trains on the corrupted batch.")

contract: BLOCKED -> ['income: above max']

This is the whole point in one line of output. The batch is structurally perfect — right columns, right types, no nulls, valid countries. The only thing wrong is that income arrived in cents, so 52,000 dollars shows up as 5,200,000. No schema check would catch it; the types are fine. But the contract’s max=1_000_000 semantic constraint does — it fires income: above max and blocks the batch at the door. The model never trains on it, the silent 8%-accuracy bleed never happens, and the alert lands on the upstream team that broke the unit, not on you three weeks later.

In one breath

A data contract is an explicit, runtime-enforced agreement on what a dataset must look like — schema plus semantics (ranges, allowed values, non-null, uniqueness, freshness) — so that when an upstream team silently flips a unit or sends nulls, the bad batch fails the pipeline at the door (BLOCKED -> ['income: above max']) instead of quietly poisoning training or serving and surfacing weeks later as sagging business metrics.

Practice

Before the quiz, reason about why the contract caught the cents bug when a schema check would not. The corrupted batch had the right columns, types, and non-null values — it passed every structural test. Which specific line of the contract caught it, and what category of rule does that belong to? Then think one step upstream: the alert fired on the producer, not the model team — why is that routing the single most valuable thing a contract buys you?

Quick check

0/3

Q1Why are data-quality failures especially dangerous in ML?

Q2What does a data contract specify beyond column names and types?

Q3Where should a data contract be enforced?

A question to carry forward

That finishes the Lifecycle chapter. Step back and look at what these four lessons quietly assembled: a loop to work inside, a baseline to clear, a data-centric habit of fixing the data over the model, and now a contract that keeps the incoming data trustworthy. Together they make the inputs and the goalposts honest.

But honest inputs are only half of reproducibility. The contract guarantees what data flowed in — yet when a model misbehaves in production, you also need to answer which model it was: trained on which snapshot, by which code commit, with which hyperparameters, scoring what against the baseline. Run a few dozen experiments and that provenance evaporates within a week unless something records it. So the question to carry into the next chapter is: how do you make every training run reproducible and comparable — so you can always trace a deployed model back to the exact data and code that produced it? That is experiment tracking, and the next lesson builds it with MLflow.

Data contracts & quality

What you'll learn

Before you start

Garbage in, wrong out — quietly

A contract: schema + semantics, enforced

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further