Data contracts & quality
Most ML failures are silent data failures. A data contract is an enforced agreement on a dataset's schema and semantics, so bad data fails the pipeline instead of poisoning the model.
What you'll learn
- Why 'garbage in' is the top cause of silent ML failure
- What a data contract is — schema + semantics, enforced at runtime
- Where to enforce it, and the tools that do it
Before you start
The most common way an ML system fails in production isn’t a bad model — it’s bad data arriving silently. An upstream team renames a column, changes a unit from dollars to cents, or starts sending nulls, and your model keeps serving predictions, now quietly wrong. A data contract is the fix: an explicit, enforced agreement about what a dataset must look like.
Garbage in, wrong out — quietly
The danger of data quality issues is that they’re silent. A crashed pipeline gets
noticed; a model that’s 8% less accurate because income is now in cents doesn’t
trip any alarm — until the business metrics sag weeks later. The model can’t tell
“corrupt data” from “the world changed.” You have to check the data before it
reaches the model.
A contract: schema + semantics, enforced
A data contract is a producer–consumer agreement that goes beyond a schema:
A contract specifies: schema (columns, types), constraints (ranges, allowed values, non-null, uniqueness), and freshness/volume expectations. It’s then enforced at runtime — checked every time data flows — so a violation blocks the pipeline at the source instead of silently corrupting training or serving.
Quick check
Quick check
Next
Data contracts feed clean data to training and serving. The last serving decision: batch vs real-time inference.
Practice this in an interview
All questionsA data contract is an explicit, enforced agreement between a data producer and consumers that specifies schema, types, semantics, and quality or freshness expectations, plus rules for how it can evolve. It prevents silent breakage by validating data at ingestion so violations are caught and quarantined or alerted instead of flowing into the model. Combined with a schema registry and backward-compatible evolution rules, it lets producers change data without unexpectedly corrupting downstream features and predictions.
Data quality checks assert that datasets meet defined expectations — completeness, uniqueness, referential integrity, value ranges — and fail the pipeline or alert when they do not. Data contracts are formal, version-controlled agreements between data producers and consumers specifying schema, semantics, and SLAs, preventing silent breaking changes from propagating downstream.
ML CI/CD must validate not just code correctness but also model quality — automated retraining triggers, data validation, model evaluation gates, and canary deployment checks that standard software pipelines have no equivalent for. A regression in model AUC is as much a deployment failure as a 500 error.
Use a schema registry with backward-compatible evolution rules so changes are managed rather than ad hoc: producers can add optional or nullable fields and consumers ignore unknown fields, which keeps existing pipelines working. Breaking changes such as renaming, removing, or retyping a field require versioning, often a new topic or table, with a migration window and deprecation before the old schema is retired. This lets data evolve continuously while ML features and models stay stable.