What is a data contract, and how does it prevent ML pipelines from breaking silently?
A data contract is an explicit, enforced agreement between a data producer and consumers that specifies schema, types, semantics, and quality or freshness expectations, plus rules for how it can evolve. It prevents silent breakage by validating data at ingestion so violations are caught and quarantined or alerted instead of flowing into the model. Combined with a schema registry and backward-compatible evolution rules, it lets producers change data without unexpectedly corrupting downstream features and predictions.
How to think about it
The short answer
A data contract is an explicit, enforced agreement between a data producer and its consumers specifying schema, types, semantics, quality/freshness SLAs, and how the data is allowed to evolve. It prevents silent breakage by validating at ingestion — violations are quarantined or alerted instead of quietly flowing into the model.
Why ML pipelines break silently without one
The most dangerous ML failures are silent: an upstream team renames a column, changes units (dollars → cents), or starts sending nulls. The pipeline doesn’t crash — it trains or serves on subtly wrong data, and you only notice when metrics sag days later. Strong engineers treat data quality as a contract problem: define what can change, what can’t, how changes are announced, and what happens on violation.
What a contract enforces
- Schema & types: required fields, types, nullability.
- Semantics & units: what a column means, not just its type.
- Quality/freshness SLAs: volume, distribution bounds, max staleness.
- Evolution rules: producers may add nullable fields; breaking changes require a new version and a migration window before deprecation.
Mechanically this is often a schema registry (with backward-compatible evolution) plus validation at ingestion that routes unknown/invalid records to quarantine while alerting on-call.
Concrete example
A producer adds a nullable device_type field — allowed, consumers ignore unknown fields. Later they try to change amount from dollars to cents — a breaking change the contract rejects, forcing a new schema version and migration. Without the contract, every prediction silently used 100x-inflated amounts.
Common follow-up / trap
Interviewers ask: “What happens when a contract is violated?” Good answers specify the action: quarantine bad records, alert the owner, and fail the pipeline rather than ingest garbage. The trap is describing a contract as just “a schema doc” — the whole value is that it’s enforced at runtime with defined consequences, and that it draws a clear line between non-breaking and breaking changes.