datarekha
Data Engineering Medium Asked at SnowflakeAsked at BigQueryAsked at DatabricksAsked at ClickHouseAsked at Redshift

How does columnar storage work, and how does partitioning improve query performance in a data warehouse?

The short answer

Columnar storage colocates values from the same column on disk, so aggregation queries read only the columns they need rather than full rows — dramatically reducing I/O on wide tables. Partitioning physically separates data into subdirectories (e.g., by date), allowing the query engine to skip entire partitions whose predicate cannot match, cutting scan volume from the full table to just the relevant slice.

How to think about it

Columnar storage and partitioning are the two most impactful performance levers in a modern warehouse — understanding them helps you write efficient queries and design tables that stay fast as data grows.

Columnar storage

In a row store, a row’s columns are adjacent on disk. In a columnar store (Parquet, ORC, Arrow), each column is written contiguously.

Why this matters for analytics:

A query like SELECT SUM(revenue) FROM fact_orders on a 100-column table only needs one column. Row storage reads all 100 columns to extract one. Columnar storage reads 1/100th the data.

Columnar formats also compress extremely well because adjacent values in a column have high correlation:

  • Dictionary encoding — repeated strings (country codes, status values) stored as integer codes.
  • Run-length encoding — consecutive identical values collapsed to a count + value pair.
  • Delta encoding — sorted integers (timestamps) stored as small differences.

A Parquet file for a status column with 95% completed values can be 50x smaller than raw.

Partitioning

Partitioning divides a table into physical subdirectories by the value of a partition column.

s3://my-bucket/orders/
  order_date=2026-06-01/  ← partition
    part-0001.parquet
  order_date=2026-06-02/
    part-0002.parquet

A query with WHERE order_date = '2026-06-05' only opens the order_date=2026-06-05/ directory. The engine never touches the other 364 days of data. This is called partition pruning.

-- With date partition: reads 1 directory out of 365
SELECT SUM(revenue) FROM fact_orders
WHERE order_date = '2026-06-05';

Clustering and Z-ordering

Partitioning works for low-cardinality columns (date, region). For high-cardinality filter columns (customer_id, product_id), use clustering keys (Snowflake) or Z-ORDER (Delta Lake) to colocate related rows within files, enabling file-level pruning via min/max statistics stored in Parquet footers.

-- Snowflake: cluster by customer_id for frequent per-customer queries
ALTER TABLE fact_orders CLUSTER BY (customer_id);
Learn it properly Columnar Storage & Parquet

Keep practising

All Data Engineering questions

Explore further

Skip to content