What are the differences between a data warehouse, a data lake, and a data lakehouse?
A data warehouse stores structured, schema-on-write data optimized for SQL analytics but is expensive for raw or unstructured data. A data lake stores any format cheaply on object storage but lacks ACID transactions and query performance. A lakehouse layers open table formats (Delta Lake, Iceberg, Hudi) on object storage to deliver warehouse-grade performance and ACID semantics at data lake costs — it is the dominant architecture in 2026.
How to think about it
Understanding where each paradigm breaks down explains why the lakehouse pattern emerged — it was not hype but a practical response to real limitations.
Data warehouse
Purpose-built analytical databases (Snowflake, BigQuery, Redshift) with columnar storage, query optimization, and tight governance. Schema is defined at write time; raw or semi-structured data requires preprocessing before it can land.
Limitations: expensive per-byte storage; raw JSON, images, and ML model artifacts do not fit naturally; vendor lock-in on proprietary formats.
Data lake
Object storage (S3, GCS, ADLS) holding raw files — CSV, Parquet, JSON, Avro. Cheap at scale, supports any format, accessible by any compute engine.
Limitations: no ACID transactions (two writers corrupt a table); no built-in schema enforcement; query engines must read full files to find a few rows; no versioning or time-travel out of the box. The “data swamp” failure mode is real.
Lakehouse
Open table formats written on top of object storage solve the lake’s limitations:
| Feature | Delta Lake | Apache Iceberg | Apache Hudi |
|---|---|---|---|
| ACID transactions | Yes | Yes | Yes |
| Time travel | Yes | Yes | Yes |
| Schema evolution | Yes | Yes | Yes |
| Hidden partitioning | No | Yes | No |
| Streaming + batch | Yes | Yes | Yes (upsert-focused) |
# Write a Delta table with ACID guarantees from Spark
df.write.format("delta").mode("overwrite").save("s3://my-bucket/orders/")
# Time travel: read yesterday's snapshot
spark.read.format("delta").option("versionAsOf", 42).load("s3://my-bucket/orders/")
The query engine (Spark, Trino, Databricks SQL, Athena) reads the table format’s transaction log to find exactly which Parquet files to read, skipping irrelevant files via partition pruning and column statistics.