datarekha

Warehouse, Lake & Lakehouse

Three ways to store analytical data — and why the lakehouse is just a data lake plus a transaction layer. Schema-on-write vs read, the data swamp, open table formats, and how one copy of data finally serves both BI and ML.

9 min read Beginner SQL Lesson 22 of 27

What you'll learn

  • Schema-on-write vs schema-on-read — the one idea that splits warehouses from lakes
  • Why cheap object storage created the "data swamp" problem
  • How an open table format (Delta, Iceberg, Hudi) adds ACID and time travel to plain files
  • Why a lakehouse lets BI and ML run on ONE copy of the data
  • The current 2025-26 state of Iceberg vs Delta and where catalogs fit

Before you start

Every analytics team eventually hits the same fork in the road. You have data — orders, events, logs, maybe images — and you need somewhere to put it so people can ask questions of it. For thirty years there were only two real answers, and they pulled in opposite directions. One was clean, governed, and expensive. The other was cheap, flexible, and a mess. The lakehouse is the recent attempt to stop choosing.

The dividing line: schema-on-write vs schema-on-read

Before any product names, there is one idea that explains the whole split: when does your data have to obey a schema?

A data warehouse uses schema-on-write. Before a single row lands, you clean it, type it, and conform it to a predefined table shape — that is what the “T” in ETL does. The upside is that everything in the warehouse is already trustworthy, governed, and ACID-safe, so SQL and BI tools just work. The downside is rigidity and cost: you model up front, and historically the storage and compute were welded together.

A data lake uses schema-on-read. You dump files in as they areParquet, JSON, CSV, logs, images — and only apply a schema later, at the moment some query reads them. Ingestion is dirt cheap and accepts anything. But the lake itself enforces nothing: no guaranteed schema, no governance, and crucially no ACID.

Schema-on-WRITEthe warehouse way1. clean + type the data2. conform to a fixed schema3. THEN write the rows✓ trusted, governed, ACID✗ rigid, costly up frontSchema-on-READthe lake way1. land files exactly as-is2. no schema enforced yet3. apply schema at query time✓ cheap, takes any data✗ no governance, no ACID
Schema-on-read does NOT mean “no schema” — it means the schema is applied later, by the reader.

A quick myth to kill now: schema-on-read is not “no schema.” A schema still gets applied — it just happens at read time, decided by the consumer, instead of being enforced at write time.

The warehouse: clean, governed, historically expensive

Think Teradata, Oracle, the classic on-premises appliance. Structured tables, modeled in advance, queried with SQL, locked down with governance. The pain was the architecture: storage and compute were tightly coupled. Need more query horsepower? You also paid for more storage, and vice versa. That made warehouses powerful but expensive and inflexible.

The cloud warehouses re-architected exactly that coupling. Snowflake splits into three independent layers — centralized storage, elastic compute (“virtual warehouses”), and cloud services — so compute scales on its own. Google BigQuery is serverless and separates its Dremel execution engine from its Colossus storage. Amazon Redshift RA3 nodes use “Redshift Managed Storage” to push cold data to S3 while caching the hot part locally. Same idea everywhere: stop welding the two together.

The lake: cheap, flexible, and prone to becoming a swamp

A data lake stores raw files of any type on cheap object storage — Amazon S3, Azure Data Lake Storage (ADLS), or Google Cloud Storage (GCS). The earliest lakes ran on Hadoop’s HDFS. The term itself was coined around 2010-2011 by James Dixon of Pentaho, contrasting a lake with a tidy little “data mart.”

The promise is real: store anything, store it cheaply, decide what it means later. But with no governance and no enforced structure, a lake left to rot becomes the thing every data engineer dreads — a data swamp: millions of undocumented files nobody can make sense of, with no transactions, no reliable schema, and no easy path to clean BI. The swamp is the canonical failure mode of an ungoverned lake, and it is precisely the gap the lakehouse was built to close.

File format vs table format — the distinction everything hinges on

Here is the single most-confused point in this whole topic, so go slow.

Parquet and ORC are FILE formats. A file format defines how the bytes of one file are laid out — Parquet and ORC store data columnar; Avro stores it row-by-row. That is all a file format does. It has no concept of a transaction across many files, no updates or deletes, no snapshots, no schema history. It is, quite literally, just files in a folder.

Delta Lake, Apache Iceberg, and Apache Hudi are TABLE formats. A table format is a thin metadata and transaction-log layer that sits on top of the data files (usually Parquet) and lets an engine treat a whole directory of files as one transactional table. That layer is what buys you:

  • ACID transactions — concurrent writes that don’t corrupt readers.
  • Time travel — query the table as it looked at an earlier snapshot.
  • Schema enforcement and evolution — reject bad writes; add columns safely.
  • Performance — skip irrelevant files using indexed statistics, instead of listing every file in the folder.

The mechanics differ. Delta Lake records each commit as a JSON entry in a _delta_log directory (periodically compacted into Parquet checkpoints), where each entry lists the files added or removed and the schema at that point. Iceberg keeps a tree of metadata — a metadata file pointing to snapshots, pointing to manifest lists, pointing to manifest files that index the data — which lets a query planner prune files at petabyte scale instead of doing a slow Hive-style “list the directory, then filter” scan.

The origins are worth knowing because they explain the design goals: Iceberg came out of Netflix (Ryan Blue and Daniel Weeks, to fix Hive metastore limits), Hudi out of Uber (the name is “Hadoop Upserts Deletes and Incrementals,” built for near-real-time upserts), and Delta Lake out of Databricks (transactional reliability for Spark on object storage). All three are open and donated to foundations.

The lakehouse: lake storage + a table format = warehouse guarantees

Now the payoff. A lakehouse keeps the lake’s cheap object storage and adds the table-format layer on top — and that combination gives you warehouse-grade behavior (ACID, governance, BI performance) directly on the files. The headline consequence: BI and ML run on ONE copy of the data instead of copying it out of a lake into a separate warehouse.

The term was popularized by Databricks in the 2021 CIDR paper “Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics” (Armbrust, Ghodsi, Xin, Zaharia). Its argument: build on open formats like Parquet, support ML and data science as first-class citizens, and still hit warehouse-class performance — addressing warehouse pain points like staleness, reliability, total cost of ownership, and lock-in.

Use the explorer to see the three architectures morph. Watch what the Lakehouse tab does that the Lake tab can’t: it slots a table-format layer between the same cheap storage and the consumers, and the property checklist flips from red to green.

The aha the widget is built to deliver: a lakehouse does not need a special new storage system. It uses the same S3/ADLS/GCS as a plain lake — the warehouse-like guarantees come from the table format’s metadata, not from new hardware.

Organizing the lake: the medallion idea, briefly

Once you have a transactional lakehouse, a common convention for arranging tables is the medallion (a.k.a. “multi-hop”) architecture, promoted by Databricks. Data flows through three logical layers:

  • Bronze — raw landed data, source structure as-is, plus load metadata.
  • Silver — cleaned, conformed, deduplicated; the enterprise view.
  • Gold — business-level aggregates and features, ready for BI and ML.

Treat medallion as a naming convention for layers, not a product or a hard rule. It is just a tidy way to refine data step by step inside the lakehouse.

Where this stands in 2025-26

The landscape has converged on two open table formats — Apache Iceberg and Delta Lake — and the interesting action has moved up to the catalog layer. A few verified proof points:

  • AWS launched Amazon S3 Tables (fully managed Iceberg) at re:Invent in December 2024 — the first cloud object store with native managed Iceberg.
  • Snowflake made Iceberg tables GA and open-sourced its catalog as Apache Polaris; BigQuery added managed Iceberg via BigLake Metastore.
  • Databricks acquired Tabular (Iceberg’s original creators) for over a billion dollars in June 2024 and now natively supports both Delta and Iceberg, pushing Delta Lake UniForm so a single Parquet table can be read as Delta or Iceberg.

Engines increasingly read these formats directly, in place — Spark, Flink, Trino/Presto, Dremio, DuckDB, Athena, plus Snowflake and BigQuery reading Iceberg — so you rarely need to copy data into a warehouse just to query it.

Two honest hedges. First, don’t crown a winner: as of 2025-26 both formats are widely used. Reported surveys suggest Iceberg leads new adoption and vendor-neutrality while Delta keeps the larger installed base (Databricks plus Microsoft Fabric, which defaults to Delta) — and the convergence story (UniForm, shared Iceberg REST catalogs) is Databricks’ stated direction, not a finished fact. Second, the catalog battle — Apache Polaris (Iceberg-focused, ASF-governed) vs Databricks Unity Catalog (open-sourced, multi-format, speaks the Iceberg REST API) — is still live, and cross-engine reads, while real, still have rough edges.

Quick check

Quick check

0/3
Q1Your colleague says: 'We turned our data lake into a lakehouse, so we had to migrate everything off S3 onto a special new storage engine.' What's wrong with this?
Q2Which statement correctly distinguishes a file format from a table format?
Q3TRANSFER: A retail team lands raw clickstream JSON into S3 with no schema, building a feature store later. Six months on, two jobs writing concurrently keep corrupting each other's reads, and there's no way to reproduce yesterday's exact table for an audit. With minimal disruption, what's the most targeted fix?

Next

You now have the storage map: warehouse (governed, schema-on-write), lake (cheap, schema-on-read), and lakehouse (the lake plus a table format that buys back the guarantees). Next we look inside how those engines actually read columnar Parquet fast — and why predicate pushdown makes a filter free.

Practice this in an interview

All questions
What are the differences between a data warehouse, a data lake, and a data lakehouse?

A data warehouse stores structured, schema-on-write data optimized for SQL analytics but is expensive for raw or unstructured data. A data lake stores any format cheaply on object storage but lacks ACID transactions and query performance. A lakehouse layers open table formats (Delta Lake, Iceberg, Hudi) on object storage to deliver warehouse-grade performance and ACID semantics at data lake costs — it is the dominant architecture in 2026.

Should you normalize or denormalize tables in a data warehouse, and why?

Data warehouses favor denormalization — wide, flat tables that trade storage for query simplicity and performance. Normalization (splitting tables to eliminate redundancy) reduces storage but multiplies join hops, increasing query complexity and optimizer cost. In columnar warehouses with compression, the storage cost of redundancy is negligible, so denormalized star schemas consistently outperform normalized models for analytical workloads.

What is the difference between OLTP and OLAP workloads, and how does that drive database design choices?

OLTP systems handle many small, latency-sensitive transactions that read and write a few rows at a time, so they are optimized for fast point lookups and row-level locking. OLAP systems run infrequent but wide analytical queries over millions of rows, so they benefit from columnar storage, bulk scans, and denormalized schemas that minimize joins.

How does columnar storage work, and how does partitioning improve query performance in a data warehouse?

Columnar storage colocates values from the same column on disk, so aggregation queries read only the columns they need rather than full rows — dramatically reducing I/O on wide tables. Partitioning physically separates data into subdirectories (e.g., by date), allowing the query engine to skip entire partitions whose predicate cannot match, cutting scan volume from the full table to just the relevant slice.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content