Star vs Snowflake Schemas
How a data warehouse organises facts and dimensions for fast analytics — and why one design has fewer joins but uses more storage.
What you'll learn
- Why warehouses (OLAP) are designed differently from transactional databases (OLTP)
- Fact tables hold measurements; dimension tables hold the context
- Star schema — denormalised dimensions, fewer joins, faster reads, more storage
- Snowflake schema — normalised dimensions, more joins, slower reads, less redundancy
- The query / storage trade-off and how GATE phrases it
Before you start
Your company runs an online store. Every sale goes into a transactional database — a row inserted, an inventory count updated, a receipt printed. Now the analytics team wants to know “total sales by product category, by state, by month, for the last five years.” Asking the live transactional database to answer that means a giant scan that competes with checkout traffic.
So we copy the history into a separate store — a data warehouse — that is designed for reads, not writes. The shape of the tables changes too. Two patterns dominate: the star schema and the snowflake schema.
OLTP vs OLAP — the warehouse exists for a reason
- OLTP (Online Transaction Processing) — the live database. Optimised for small, fast inserts and updates: “record one sale.” Tables are highly normalised to avoid redundancy.
- OLAP (Online Analytical Processing) — the warehouse. Optimised for big aggregate read queries: “sum sales by region and quarter.” Tables are deliberately laid out to minimise joins on read.
The warehouse stores history — old rows are kept, not overwritten — so you can compare this year with last year.
Fact and dimension tables
Every warehouse schema splits the data into two kinds of tables:
- Fact table — one row per event you measure. Numeric measures (amount, quantity, profit) plus foreign keys that point into the dimension tables.
- Dimension table — descriptive context. The Date dimension carries day, month, quarter, year; the Product dimension carries name, category, brand; the Store dimension carries city, state, region.
The fact table is usually huge (millions of rows). Dimensions are small (thousands of products, a few hundred stores, ~3650 dates for ten years).
The two layouts
- Star schema — the fact table sits at the centre; each dimension is a single denormalised table (denormalised = deliberately keeping repeated values in one wide table instead of splitting them out). To look up a product’s category, you just read another column of the Product table.
- Snowflake schema — the dimensions are further normalised into
sub-tables. The Product table no longer carries category and brand names
directly; instead it carries
category_idandbrand_idforeign keys that point into separate ProductCategory and ProductBrand tables.
The trade-off in one sentence
- Star: fewer joins per query → faster reads, but the same category name is duplicated across every product in that category → more storage.
- Snowflake: each category name lives once → less storage, but every query that needs the category name pays for extra joins → slower reads.
How GATE asks this
Usually an MCQ giving a scenario (“the analytics team prizes query speed over disk usage — which schema?”) or asking you to count joins for a query on each schema. Sometimes an MSQ listing characteristics and asking which belong to star, which to snowflake.
Worked example — joins in star vs snowflake
A query: “Total Sales amount by product Category for 2024.”
Star schema layout (1000 products, denormalised):
Sales(amount, product_id, date_id, store_id, customer_id) -- ~10M rows
Product(product_id, name, category, brand) -- 1000 rows (category sits inline)
Date(date_id, day, month, year) -- ~3650 rows
The query: Sales JOIN Product JOIN Date — 2 joins (Date for the year
filter, Product for the category). Category is just a column in Product, no
extra hop.
Snowflake schema layout (Product normalised further):
Sales(amount, product_id, date_id, store_id, customer_id)
Product(product_id, name, category_id, brand_id) -- 1000 rows
ProductCategory(category_id, category_name) -- ~20 rows
ProductBrand(brand_id, brand_name) -- ~100 rows
Date(date_id, day, month, year)
Same query now: Sales JOIN Product JOIN ProductCategory JOIN Date — 3
joins. (Brand is not needed for this query, so we skip ProductBrand.) Storage
wins because each category name is stored exactly once in the 20-row
ProductCategory table instead of being repeated across 1000 product rows.
Quick check
Quick check
Practice this in an interview
All questionsA star schema has a central fact table joined directly to denormalized dimension tables — one join hop per dimension, simple queries, better query performance. A snowflake schema normalizes dimension tables into sub-dimensions, reducing storage redundancy but requiring more joins. Star schemas are preferred for analytics workloads; snowflake schemas are sometimes used when a dimension is very large and has many redundant attribute values.
A star schema has a central fact table joined directly to denormalized dimension tables, giving simple two-table joins and fast query performance at the cost of some data redundancy. A snowflake schema normalizes dimensions into sub-dimension tables, reducing storage and update anomalies but requiring more joins that can slow analytical queries.
Data warehouses favor denormalization — wide, flat tables that trade storage for query simplicity and performance. Normalization (splitting tables to eliminate redundancy) reduces storage but multiplies join hops, increasing query complexity and optimizer cost. In columnar warehouses with compression, the storage cost of redundancy is negligible, so denormalized star schemas consistently outperform normalized models for analytical workloads.
A data warehouse stores structured, schema-on-write data optimized for SQL analytics but is expensive for raw or unstructured data. A data lake stores any format cheaply on object storage but lacks ACID transactions and query performance. A lakehouse layers open table formats (Delta Lake, Iceberg, Hudi) on object storage to deliver warehouse-grade performance and ACID semantics at data lake costs — it is the dominant architecture in 2026.