datarekha
Data Engineering Medium Asked at ConfluentAsked at DatabricksAsked at SnowflakeAsked at Airflow

How do you handle schema evolution in data pipelines without breaking downstream consumers?

The short answer

Schema evolution covers adding, renaming, removing, or retyping columns in a data stream or table over time. Safe strategies include: only adding nullable columns (backwards-compatible), using schema registries to enforce compatibility rules before a producer publishes, and open table formats like Iceberg that track schema history and allow column renames and reorders without rewriting data.

How to think about it

Schema evolution is one of the most common causes of silent pipeline breakage. A backend engineer adds a column, renames a field, or changes a type — downstream consumers that were not notified break quietly or start producing nulls.

Compatibility modes

Schema registries (Confluent Schema Registry, AWS Glue Schema Registry) enforce compatibility before a producer can publish a new schema version:

ModeAllows
BACKWARDNew schema can read data written by old schema
FORWARDOld schema can read data written by new schema
FULLBoth directions — strictest, safest for multi-team pipelines
NONENo check — dangerous in production

BACKWARD-compatible changes (safe):

  • Adding a column with a default value or null
  • Adding an optional Avro/Protobuf field

Breaking changes:

  • Removing a required column
  • Renaming a column
  • Changing a column type (e.g., INT to STRING)

Handling evolution in practice

Avro / Protobuf with schema registry:

# Producer registers schema before first publish
schema_registry_client.register_schema("orders-value", new_avro_schema)
# Registry checks compatibility; rejects if BACKWARD is violated

Parquet / Delta Lake — column addition:

# Delta Lake: adding a column is safe with mergeSchema
df.write.format("delta") \
    .option("mergeSchema", "true") \
    .mode("append") \
    .save("s3://my-bucket/orders/")

Iceberg — full schema evolution without rewriting data:

Iceberg tracks schema changes by column ID (not name), so renaming a column does not require rewriting Parquet files — the metadata layer maps the new name to the existing column ID.

-- Iceberg: rename column without touching data files
ALTER TABLE orders RENAME COLUMN amt TO amount_usd;

dbt contract enforcement

dbt 1.5+ supports model contracts: explicitly declare column names and types. A model that changes a column type or removes a declared column fails CI before it reaches production.

models:
  - name: orders_daily
    config:
      contract:
        enforced: true
    columns:
      - name: order_date
        data_type: date
      - name: revenue_usd
        data_type: numeric

Keep practising

All Data Engineering questions

Explore further

Skip to content