Warehouse, Lake & Lakehouse
Three ways to store analytical data — and why the lakehouse is just a data lake plus a transaction layer. Schema-on-write vs read, the data swamp, open table formats, and how one copy of data finally serves both BI and ML.
What you'll learn
- Schema-on-write vs schema-on-read — the one idea that splits warehouses from lakes
- Why cheap object storage created the "data swamp" problem
- How an open table format (Delta, Iceberg, Hudi) adds ACID and time travel to plain files
- Why a lakehouse lets BI and ML run on ONE copy of the data
- The current 2025-26 state of Iceberg vs Delta and where catalogs fit
Before you start
Every analytics team eventually hits the same fork in the road. You have data — orders, events, logs, maybe images — and you need somewhere to put it so people can ask questions of it. For thirty years there were only two real answers, and they pulled in opposite directions. One was clean, governed, and expensive. The other was cheap, flexible, and a mess. The lakehouse is the recent attempt to stop choosing.
The dividing line: schema-on-write vs schema-on-read
Before any product names, there is one idea that explains the whole split: when does your data have to obey a schema?
A data warehouse uses schema-on-write. Before a single row lands, you clean it, type it, and conform it to a predefined table shape — that is what the “T” in ETL does. The upside is that everything in the warehouse is already trustworthy, governed, and ACID-safe, so SQL and BI tools just work. The downside is rigidity and cost: you model up front, and historically the storage and compute were welded together.
A data lake uses schema-on-read. You dump files in as they are — Parquet, JSON, CSV, logs, images — and only apply a schema later, at the moment some query reads them. Ingestion is dirt cheap and accepts anything. But the lake itself enforces nothing: no guaranteed schema, no governance, and crucially no ACID.
A quick myth to kill now: schema-on-read is not “no schema.” A schema still gets applied — it just happens at read time, decided by the consumer, instead of being enforced at write time.
The warehouse: clean, governed, historically expensive
Think Teradata, Oracle, the classic on-premises appliance. Structured tables, modeled in advance, queried with SQL, locked down with governance. The pain was the architecture: storage and compute were tightly coupled. Need more query horsepower? You also paid for more storage, and vice versa. That made warehouses powerful but expensive and inflexible.
The cloud warehouses re-architected exactly that coupling. Snowflake splits into three independent layers — centralized storage, elastic compute (“virtual warehouses”), and cloud services — so compute scales on its own. Google BigQuery is serverless and separates its Dremel execution engine from its Colossus storage. Amazon Redshift RA3 nodes use “Redshift Managed Storage” to push cold data to S3 while caching the hot part locally. Same idea everywhere: stop welding the two together.
The lake: cheap, flexible, and prone to becoming a swamp
A data lake stores raw files of any type on cheap object storage — Amazon S3, Azure Data Lake Storage (ADLS), or Google Cloud Storage (GCS). The earliest lakes ran on Hadoop’s HDFS. The term itself was coined around 2010-2011 by James Dixon of Pentaho, contrasting a lake with a tidy little “data mart.”
The promise is real: store anything, store it cheaply, decide what it means later. But with no governance and no enforced structure, a lake left to rot becomes the thing every data engineer dreads — a data swamp: millions of undocumented files nobody can make sense of, with no transactions, no reliable schema, and no easy path to clean BI. The swamp is the canonical failure mode of an ungoverned lake, and it is precisely the gap the lakehouse was built to close.
File format vs table format — the distinction everything hinges on
Here is the single most-confused point in this whole topic, so go slow.
Parquet and ORC are FILE formats. A file format defines how the bytes of one file are laid out — Parquet and ORC store data columnar; Avro stores it row-by-row. That is all a file format does. It has no concept of a transaction across many files, no updates or deletes, no snapshots, no schema history. It is, quite literally, just files in a folder.
Delta Lake, Apache Iceberg, and Apache Hudi are TABLE formats. A table format is a thin metadata and transaction-log layer that sits on top of the data files (usually Parquet) and lets an engine treat a whole directory of files as one transactional table. That layer is what buys you:
- ACID transactions — concurrent writes that don’t corrupt readers.
- Time travel — query the table as it looked at an earlier snapshot.
- Schema enforcement and evolution — reject bad writes; add columns safely.
- Performance — skip irrelevant files using indexed statistics, instead of listing every file in the folder.
The mechanics differ. Delta Lake records each commit as a JSON entry in
a _delta_log directory (periodically compacted into Parquet
checkpoints), where each entry lists the files added or removed and the
schema at that point. Iceberg keeps a tree of metadata — a metadata file
pointing to snapshots, pointing to manifest lists, pointing to manifest
files that index the data — which lets a query planner prune files at
petabyte scale instead of doing a slow Hive-style “list the directory,
then filter” scan.
The origins are worth knowing because they explain the design goals: Iceberg came out of Netflix (Ryan Blue and Daniel Weeks, to fix Hive metastore limits), Hudi out of Uber (the name is “Hadoop Upserts Deletes and Incrementals,” built for near-real-time upserts), and Delta Lake out of Databricks (transactional reliability for Spark on object storage). All three are open and donated to foundations.
The lakehouse: lake storage + a table format = warehouse guarantees
Now the payoff. A lakehouse keeps the lake’s cheap object storage and adds the table-format layer on top — and that combination gives you warehouse-grade behavior (ACID, governance, BI performance) directly on the files. The headline consequence: BI and ML run on ONE copy of the data instead of copying it out of a lake into a separate warehouse.
The term was popularized by Databricks in the 2021 CIDR paper “Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics” (Armbrust, Ghodsi, Xin, Zaharia). Its argument: build on open formats like Parquet, support ML and data science as first-class citizens, and still hit warehouse-class performance — addressing warehouse pain points like staleness, reliability, total cost of ownership, and lock-in.
Use the explorer to see the three architectures morph. Watch what the Lakehouse tab does that the Lake tab can’t: it slots a table-format layer between the same cheap storage and the consumers, and the property checklist flips from red to green.
The aha the widget is built to deliver: a lakehouse does not need a special new storage system. It uses the same S3/ADLS/GCS as a plain lake — the warehouse-like guarantees come from the table format’s metadata, not from new hardware.
Organizing the lake: the medallion idea, briefly
Once you have a transactional lakehouse, a common convention for arranging tables is the medallion (a.k.a. “multi-hop”) architecture, promoted by Databricks. Data flows through three logical layers:
- Bronze — raw landed data, source structure as-is, plus load metadata.
- Silver — cleaned, conformed, deduplicated; the enterprise view.
- Gold — business-level aggregates and features, ready for BI and ML.
Treat medallion as a naming convention for layers, not a product or a hard rule. It is just a tidy way to refine data step by step inside the lakehouse.
Where this stands in 2025-26
The landscape has converged on two open table formats — Apache Iceberg and Delta Lake — and the interesting action has moved up to the catalog layer. A few verified proof points:
- AWS launched Amazon S3 Tables (fully managed Iceberg) at re:Invent in December 2024 — the first cloud object store with native managed Iceberg.
- Snowflake made Iceberg tables GA and open-sourced its catalog as Apache Polaris; BigQuery added managed Iceberg via BigLake Metastore.
- Databricks acquired Tabular (Iceberg’s original creators) for over a billion dollars in June 2024 and now natively supports both Delta and Iceberg, pushing Delta Lake UniForm so a single Parquet table can be read as Delta or Iceberg.
Engines increasingly read these formats directly, in place — Spark, Flink, Trino/Presto, Dremio, DuckDB, Athena, plus Snowflake and BigQuery reading Iceberg — so you rarely need to copy data into a warehouse just to query it.
Two honest hedges. First, don’t crown a winner: as of 2025-26 both formats are widely used. Reported surveys suggest Iceberg leads new adoption and vendor-neutrality while Delta keeps the larger installed base (Databricks plus Microsoft Fabric, which defaults to Delta) — and the convergence story (UniForm, shared Iceberg REST catalogs) is Databricks’ stated direction, not a finished fact. Second, the catalog battle — Apache Polaris (Iceberg-focused, ASF-governed) vs Databricks Unity Catalog (open-sourced, multi-format, speaks the Iceberg REST API) — is still live, and cross-engine reads, while real, still have rough edges.
Quick check
Quick check
Next
You now have the storage map: warehouse (governed, schema-on-write), lake (cheap, schema-on-read), and lakehouse (the lake plus a table format that buys back the guarantees). Next we look inside how those engines actually read columnar Parquet fast — and why predicate pushdown makes a filter free.
Practice this in an interview
All questionsA data warehouse stores structured, schema-on-write data optimized for SQL analytics but is expensive for raw or unstructured data. A data lake stores any format cheaply on object storage but lacks ACID transactions and query performance. A lakehouse layers open table formats (Delta Lake, Iceberg, Hudi) on object storage to deliver warehouse-grade performance and ACID semantics at data lake costs — it is the dominant architecture in 2026.
Data warehouses favor denormalization — wide, flat tables that trade storage for query simplicity and performance. Normalization (splitting tables to eliminate redundancy) reduces storage but multiplies join hops, increasing query complexity and optimizer cost. In columnar warehouses with compression, the storage cost of redundancy is negligible, so denormalized star schemas consistently outperform normalized models for analytical workloads.
OLTP systems handle many small, latency-sensitive transactions that read and write a few rows at a time, so they are optimized for fast point lookups and row-level locking. OLAP systems run infrequent but wide analytical queries over millions of rows, so they benefit from columnar storage, bulk scans, and denormalized schemas that minimize joins.
Columnar storage colocates values from the same column on disk, so aggregation queries read only the columns they need rather than full rows — dramatically reducing I/O on wide tables. Partitioning physically separates data into subdirectories (e.g., by date), allowing the query engine to skip entire partitions whose predicate cannot match, cutting scan volume from the full table to just the relevant slice.