Columnar Storage & Parquet
Why an analytical engine can sum a billion-row table while reading almost none of it. Row vs column layout, how Parquet's row groups and column chunks enable column pruning and predicate pushdown, and why storing columns together compresses far better.
What you'll learn
- Why analytics is column-shaped — few columns, many rows — and OLTP is row-shaped
- How columnar layout enables column pruning (read only the columns you select)
- The anatomy of a Parquet file — row groups, column chunks, pages, and the footer
- Predicate pushdown — skipping whole row groups using min/max statistics
- Why columns of similar values compress dramatically better than rows
Before you start
The last lesson left a promise: a lakehouse stores cheap Parquet files but queries them at warehouse speed — how? The answer is one of the most leveraged ideas in all of data engineering, and it comes down to a single choice about which direction you write the bytes. Store your table by row and analytics crawls. Store it by column and the same engine can sum a billion rows while physically reading a tiny sliver of the file.
Two ways to lay a table on disk
A table is a 2-D grid, but storage is 1-D — a flat sequence of bytes. So you must choose an order.
A row-oriented store (every OLTP database) writes each row’s fields
together: [1,West,Bolt,5,4,20][2,East,Nut,8,2,16]…. Perfect when you
fetch or update whole records one at a time — grab order #5071, all its
columns are adjacent.
A column-oriented store writes each column’s values together:
[1,2,3,…][West,East,West,…][Bolt,Nut,Gear,…]…. Now a whole column is one
contiguous run.
Analytical queries almost never want whole rows. SELECT SUM(amount) ... GROUP BY region touches 2 of 20 columns over every row. On a row
store, those two columns are smeared across the whole file — to read them
you drag every other column through the disk and CPU. On a column store,
they’re two tidy runs you read directly and the other 18 columns are never
touched. That’s column pruning, and it’s the first win.
Try it — the same query, three layouts:
The same query, three storage layouts
Query: SELECT SUM(amount) WHERE qty >= 5. Only two columns matter. Switch the layout and watch how many cells the engine must actually read — green is read, faded is skipped.
| id | region | product | qtyfilter | price | amountsum |
|---|---|---|---|---|---|
| 1 | West | Bolt | 5 | 4 | 20 |
| 2 | East | Nut | 8 | 2 | 16 |
| 3 | West | Gear | 6 | 9 | 54 |
| 4 | North | Pin | 7 | 3 | 21 |
| row group A · qty min 5, max 8 | |||||
| 5 | East | Cap | 2 | 5 | 10 |
| 6 | South | Rod | 3 | 7 | 21 |
| 7 | East | Clip | 1 | 2 | 2 |
| 8 | North | Tab | 4 | 6 | 24 |
| row group B · qty min 1, max 4 | |||||
Columnar: each column is stored together, so the engine reads only the qty and amount chunks and ignores the other four — column pruning.
Inside a Parquet file
Parquet is the de-facto open columnar format (Apache, born at Twitter and Cloudera around 2013). A file is not just “columns” — it has a precise nested structure built for exactly the skipping you saw above:
- A file is split into row groups — horizontal slabs of, say, ~1 million rows each.
- Within a row group, each column is stored as a column chunk (all that column’s values for those rows, together).
- A column chunk is divided into pages — the smallest unit of encoding and compression.
- A footer at the end holds the schema plus per-column-chunk statistics: min, max, null count, and sometimes a dictionary.
Predicate pushdown: skipping data you never read
Because the footer carries min/max stats per column chunk, a reader can
evaluate a WHERE filter against the statistics before touching the
data. If you ask for WHERE qty >= 5 and a row group’s qty column chunk
has max = 4, that group cannot contain a match — so the engine skips
it entirely, never decompressing a single page. That’s predicate
pushdown (a.k.a. row-group skipping), the second win you saw in the
visualizer: column pruning cuts the columns, pushdown cuts the rows.
This is also why sort order matters. If a table is sorted (or roughly clustered) by the column you filter on, each row group’s min/max range is narrow and non-overlapping, so pushdown can prune almost everything. Random order means every group’s range overlaps your filter and nothing gets skipped.
Why columns compress so well
The third win is compression. A column holds values of one type with
similar values — a country column is a sea of repeated "US", a
status column cycles through five strings, a sorted date column barely
changes row to row. That uniformity is a compressor’s dream:
- Dictionary encoding replaces repeated values with small integer
codes (
"US"→0), storing the dictionary once. - Run-length encoding (RLE) collapses
US,US,US,USinto “US ×4.” - A general codec (Snappy, Zstd, gzip) then squeezes the result.
A row store can’t do this nearly as well — adjacent bytes are a jumble of different types (an int, then a string, then a float), so the patterns compression feeds on are broken up. Columnar files routinely land 5–10× smaller, and smaller files mean less I/O, which means faster queries on top of everything else.
Quick check
Quick check
Next
You can model a warehouse, version its history, and store it so queries read almost nothing. The remaining question is how data gets in and stays fresh. The classic answer is batch ETL/ELT; the modern, low-latency one is Change Data Capture — streaming every insert, update, and delete out of a source database the instant it commits. That’s next.
Questions about this lesson
Why is columnar storage faster than row storage for analytics?
Analytical queries touch a few columns over many rows. A columnar layout stores each column contiguously, so the engine reads only the columns the query references (column pruning) and skips the rest, while a row store smears those columns across every record. Columns of similar values also compress 5–10x better, cutting I/O further.
What is predicate pushdown in Parquet?
Predicate pushdown uses the min/max statistics Parquet stores per row group in its footer to skip data without reading it. If a query filters WHERE qty >= 5 and a row group's qty column has max = 4, that group cannot contain a match, so the engine skips it entirely — also called row-group skipping. Sorting the data by the filter column makes pushdown far more effective.
What is a row group in a Parquet file?
A row group is a horizontal slab of a Parquet file — roughly a million rows — within which each column is stored together as a column chunk, subdivided into pages. The file footer records the schema plus per-column-chunk statistics (min, max, null count). Row groups are the unit that predicate pushdown skips, and column chunks are what column pruning reads selectively.
Practice this in an interview
All questionsColumnar storage colocates values from the same column on disk, so aggregation queries read only the columns they need rather than full rows — dramatically reducing I/O on wide tables. Partitioning physically separates data into subdirectories (e.g., by date), allowing the query engine to skip entire partitions whose predicate cannot match, cutting scan volume from the full table to just the relevant slice.
Parquet is a columnar, compressed format optimized for analytical reads — only the queried columns are scanned. Avro is row-oriented, schema-embedded, and optimized for write-heavy pipelines and Kafka serialization. CSV is human-readable but schema-less, uncompressed, and slow at scale — use it only at system boundaries where a downstream tool requires it.
Partitioning divides a large table into smaller physical segments (partitions) based on a column value, so the planner can skip irrelevant partitions entirely — a technique called partition pruning. It improves performance for queries that filter on the partition key, and it simplifies bulk data management tasks like dropping old data by dropping a partition instead of issuing a slow DELETE.
Data warehouses favor denormalization — wide, flat tables that trade storage for query simplicity and performance. Normalization (splitting tables to eliminate redundancy) reduces storage but multiplies join hops, increasing query complexity and optimizer cost. In columnar warehouses with compression, the storage cost of redundancy is negligible, so denormalized star schemas consistently outperform normalized models for analytical workloads.