Columnar Storage & Parquet

Why an analytical engine can sum a billion-row table while reading almost none of it. Row vs column layout, how Parquet's row groups and column chunks enable column pruning and predicate pushdown, and why storing columns together compresses far better.

9 min read Intermediate SQL Lesson 23 of 27

What you'll learn

Why analytics is column-shaped — few columns, many rows — and OLTP is row-shaped
How columnar layout enables column pruning (read only the columns you select)
The anatomy of a Parquet file — row groups, column chunks, pages, and the footer
Predicate pushdown — skipping whole row groups using min/max statistics
Why columns of similar values compress dramatically better than rows

Before you start

Warehouse, Lake & Lakehouse SQL · lesson OLTP vs OLAP SQL · lesson Python section

The last lesson left a promise: a lakehouse stores cheap Parquet files but queries them at warehouse speed — how? The answer is one of the most leveraged ideas in all of data engineering, and it comes down to a single choice about which direction you write the bytes. Store your table by row and analytics crawls. Store it by column and the same engine can sum a billion rows while physically reading a tiny sliver of the file.

Two ways to lay a table on disk

A table is a 2-D grid, but storage is 1-D — a flat sequence of bytes. So you must choose an order.

A row-oriented store (every OLTP database) writes each row’s fields together: [1,West,Bolt,5,4,20][2,East,Nut,8,2,16]…. Perfect when you fetch or update whole records one at a time — grab order #5071, all its columns are adjacent.

A column-oriented store writes each column’s values together: [1,2,3,…][West,East,West,…][Bolt,Nut,Gear,…]…. Now a whole column is one contiguous run.

Analytical queries almost never want whole rows. SELECT SUM(amount) ... GROUP BY region touches 2 of 20 columns over every row. On a row store, those two columns are smeared across the whole file — to read them you drag every other column through the disk and CPU. On a column store, they’re two tidy runs you read directly and the other 18 columns are never touched. That’s column pruning, and it’s the first win.

Picture that same query over a six-column table:

That is column pruning in one picture: the engine reads the two columns the query names and never opens the other four. A row store cannot do this — those two columns are smeared across every record, so reaching them drags the whole table through disk and CPU.

Inside a Parquet file

Parquet is the de-facto open columnar format (Apache, born at Twitter and Cloudera around 2013). A file is not just “columns” — it has a precise nested structure built for exactly the skipping you saw above:

A file is split into row groups — horizontal slabs of, say, ~1 million rows each.
Within a row group, each column is stored as a column chunk (all that column’s values for those rows, together).
A column chunk is divided into pages — the smallest unit of encoding and compression.
A footer at the end holds the schema plus per-column-chunk statistics: min, max, null count, and sometimes a dictionary.

Predicate pushdown: skipping data you never read

Because the footer carries min/max stats per column chunk, a reader can evaluate a WHERE filter against the statistics before touching the data. If you ask for WHERE qty >= 5 and a row group’s qty column chunk has max = 4, that group cannot contain a match — so the engine skips it entirely, never decompressing a single page. That’s predicate pushdown (a.k.a. row-group skipping), the second win you saw in the visualizer: column pruning cuts the columns, pushdown cuts the rows.

This is also why sort order matters. If a table is sorted (or roughly clustered) by the column you filter on, each row group’s min/max range is narrow and non-overlapping, so pushdown can prune almost everything. Random order means every group’s range overlaps your filter and nothing gets skipped.

Why columns compress so well

The third win is compression. A column holds values of one type with similar values — a country column is a sea of repeated "US", a status column cycles through five strings, a sorted date column barely changes row to row. That uniformity is a compressor’s dream:

Dictionary encoding replaces repeated values with small integer codes ("US" → 0), storing the dictionary once.
Run-length encoding (RLE) collapses US,US,US,US into “US ×4.”
A general codec (Snappy, Zstd, gzip) then squeezes the result.

A row store can’t do this nearly as well — adjacent bytes are a jumble of different types (an int, then a string, then a float), so the patterns compression feeds on are broken up. Columnar files routinely land 5–10× smaller, and smaller files mean less I/O, which means faster queries on top of everything else.

Quick check

0/3

Q1Why is a column-oriented layout faster than a row-oriented one for SELECT SUM(amount) GROUP BY region over a 50-column table?

Q2A Parquet row group's footer says its qty column chunk has min=12, max=40. Your query filters WHERE qty >= 5. What can the engine conclude?

Q3TRANSFER: A pipeline currently lands analytics data as gzipped CSV and queries are slow and expensive. Suggest the highest-leverage storage change and why it helps without touching the SQL.

You can model a warehouse, version its history, and store it so queries read almost nothing. The remaining question is how data gets in and stays fresh. The classic answer is batch ETL/ELT; the modern, low-latency one is Change Data Capture — streaming every insert, update, and delete out of a source database the instant it commits. That’s next.

FAQCommon questions

Questions about this lesson

Why is columnar storage faster than row storage for analytics?

Analytical queries touch a few columns over many rows. A columnar layout stores each column contiguously, so the engine reads only the columns the query references (column pruning) and skips the rest, while a row store smears those columns across every record. Columns of similar values also compress 5–10x better, cutting I/O further.

What is predicate pushdown in Parquet?

Predicate pushdown uses the min/max statistics Parquet stores per row group in its footer to skip data without reading it. If a query filters WHERE qty >= 5 and a row group's qty column has max = 4, that group cannot contain a match, so the engine skips it entirely — also called row-group skipping. Sorting the data by the filter column makes pushdown far more effective.

What is a row group in a Parquet file?

A row group is a horizontal slab of a Parquet file — roughly a million rows — within which each column is stored together as a column chunk, subdivided into pages. The file footer records the schema plus per-column-chunk statistics (min, max, null count). Row groups are the unit that predicate pushdown skips, and column chunks are what column pruning reads selectively.

Practice this in an interview

All questions

How does columnar storage work, and how does partitioning improve query performance in a data warehouse?

Columnar storage colocates values from the same column on disk, so aggregation queries read only the columns they need rather than full rows — dramatically reducing I/O on wide tables. Partitioning physically separates data into subdirectories (e.g., by date), allowing the query engine to skip entire partitions whose predicate cannot match, cutting scan volume from the full table to just the relevant slice.

What is table partitioning and when does it improve query performance?

Partitioning divides a large table into smaller physical segments (partitions) based on a column value, so the planner can skip irrelevant partitions entirely — a technique called partition pruning. It improves performance for queries that filter on the partition key, and it simplifies bulk data management tasks like dropping old data by dropping a partition instead of issuing a slow DELETE.

Compare Parquet, CSV, and Avro as big-data file formats — when do you use each?

Parquet is a columnar, compressed format optimized for analytical reads — only the queried columns are scanned. Avro is row-oriented, schema-embedded, and optimized for write-heavy pipelines and Kafka serialization. CSV is human-readable but schema-less, uncompressed, and slow at scale — use it only at system boundaries where a downstream tool requires it.

Should you normalize or denormalize tables in a data warehouse, and why?

Data warehouses favor denormalization — wide, flat tables that trade storage for query simplicity and performance. Normalization (splitting tables to eliminate redundancy) reduces storage but multiplies join hops, increasing query complexity and optimizer cost. In columnar warehouses with compression, the storage cost of redundancy is negligible, so denormalized star schemas consistently outperform normalized models for analytical workloads.

Explore further

Glossary terms

Columnar Storage Parquet Partitioning Lakehouse

Cheat sheets

SQL NumPy Pandas

Columnar Storage & Parquet

What you'll learn

Before you start

Two ways to lay a table on disk

Inside a Parquet file

Predicate pushdown: skipping data you never read

Why columns compress so well

Quick check

Quick check

Next

Sign in to track your progress

Questions about this lesson

Practice this in an interview

Related lessons

Explore further