Dimensional Modeling: Star & Snowflake

How analytical tables are actually shaped — facts vs dimensions, declaring the grain, and why the star schema's denormalised dimensions beat a normalised one for BI.

9 min read Beginner SQL Lesson 20 of 27

What you'll learn

The two table types every warehouse is built from — facts and dimensions
Why 'declaring the grain' is the first and most important modeling decision
Star (denormalised, one join each) versus snowflake (normalised, more joins)
Why warehouses use integer surrogate keys instead of natural business keys

Before you start

OLTP vs OLAP SQL · lesson INNER JOIN SQL · lesson Python section

Your production database is normalised to within an inch of its life — and rightly so. Every fact lives in exactly one place, updates are cheap, and nothing can contradict itself.

Then an analyst asks for “revenue by region, by product category, by month.” Suddenly that beautiful schema is a twelve-join nightmare: orders, order lines, products, categories, stores, regions, a date lookup — slow to write, slow to run, easy to get wrong. Dimensional modeling is the craft of re-shaping data for questions instead of transactions, and almost every warehouse and dashboard you will meet is built on its two ideas — facts and dimensions.

Two kinds of table

Dimensional modeling — popularised by Ralph Kimball in the 1990s — says every analytical table is one of exactly two types. A fact table records measurements of events: one row per thing that happened (a sale, a click, a shipment). It is tall and thin — millions of rows but few columns: a handful of numeric measures (quantity, amount) plus foreign keys pointing at the dimensions. Facts are where the maths happens; you SUM and COUNT and AVG them. A dimension table holds the descriptive context — the who, what, where, and when. It is short and wide: relatively few rows, but many text columns you filter and group by (product_name, category, brand). Dimensions are where the labels live.

A star schema: one central fact table surrounded by the dimensions that describe it.

A quick test sorts almost anything: if you would SUM it, it is a measure in a fact; if you would GROUP BY it, it is an attribute in a dimension. Revenue is a fact; the product category you slice it by is a dimension.

Declare the grain first

Before you add a single column, answer one question: what does one row of the fact table mean? That is the grain, and Kimball’s first rule is to declare it before anything else. “One row per order line” is a fine grain; so is “one row per order” or “one row per product per store per day.” But you must pick one and state it, because the grain decides everything downstream — which dimensions can attach, what a COUNT(*) means, and whether your sums can be trusted.

Star versus snowflake

The default design is the star schema: the fact table in the middle, each dimension hanging directly off it by a foreign key — the shape literally looks like a star. Its defining move is denormalisation. A dim_product row carries everything about the product flat and wide — product_name, category, subcategory, brand, department, side by side — even though category and department repeat across thousands of products. That redundancy is deliberate, because it makes “revenue by department” a single join.

A snowflake schema instead normalises those wide dimensions into sub-tables: dim_product points to dim_category, which points to dim_department, branching outward like a snowflake. It is the trade normalisation always makes — less redundancy (each department name stored once) at the cost of more tables and more joins. Now “revenue by department” is no longer one hop but fact ⋈ product ⋈ category ⋈ department, three.

So which wins? Kimball’s guidance, and the modern default, is star — for three reasons. Storage is cheap and joins are expensive: the star’s redundant text costs a little disk, while the snowflake’s extra joins cost query time on every analytical query, the hot path. Columnar engines erase the downside anyway, compressing a repeated department string down to almost nothing. And a star is simply easier for humans and BI tools to get right. Snowflaking earns its keep only in narrow cases — a genuinely enormous dimension where the redundancy is real money. Otherwise, keep it flat.

Surrogate keys

One detail you will see everywhere: a dimension gets a surrogate key — a meaningless auto-incrementing integer (product_key = 42) — and the fact stores that, not the natural business key ("SKU-9981"). Three reasons. Integer joins are faster and smaller than joining on long text. Surrogates insulate the warehouse from the source system renumbering or reusing its ids. And — the big one — they are what let a dimension track history: when a product’s category changes, you keep the old version and the new one as two rows with two surrogate keys while the natural key stays the same. That technique is the whole next lesson.

Practice

Quick check

0/3

Q1What distinguishes a fact table from a dimension table?

Q2Why does a star schema deliberately denormalise its dimensions (storing 'department' on every product row)?

Q3TRANSFER: A team declares fact_sales grain as 'one row per order line.' Later someone adds per-order shipping fees as extra rows in the same table. What breaks?

FAQCommon questions

Questions about this lesson

What is the difference between a fact table and a dimension table?

A fact table stores numeric measurements of events — one row per event (a sale, a click), with measures you SUM and foreign keys to dimensions. A dimension table stores descriptive context — one row per thing (a product, a customer) with text attributes you filter and GROUP BY. Quick test: if you would SUM it, it is a fact; if you would GROUP BY it, it is a dimension.

What is the grain of a fact table?

The grain is the precise definition of what one row of the fact table represents — for example, one row per order line, or one row per product per store per day. Kimball's first rule is to declare the grain before adding any columns, because it determines which dimensions can attach and whether your aggregates are correct. Mixing two grains in one table makes every SUM and COUNT unreliable.

Star schema vs snowflake schema — which should I use?

Use a star schema by default. A star keeps each dimension as one wide, denormalized table so every query is a single join per dimension; a snowflake normalizes dimensions into sub-tables, removing redundancy at the cost of more joins. Storage is cheap and columnar engines compress the redundancy away, so the star's simpler, faster joins almost always win.

Practice this in an interview

All questions

Should you normalize or denormalize tables in a data warehouse, and why?

Data warehouses favor denormalization — wide, flat tables that trade storage for query simplicity and performance. Normalization (splitting tables to eliminate redundancy) reduces storage but multiplies join hops, increasing query complexity and optimizer cost. In columnar warehouses with compression, the storage cost of redundancy is negligible, so denormalized star schemas consistently outperform normalized models for analytical workloads.

What is the difference between a star schema and a snowflake schema in dimensional modeling?

A star schema has a central fact table joined directly to denormalized dimension tables — one join hop per dimension, simple queries, better query performance. A snowflake schema normalizes dimension tables into sub-dimensions, reducing storage redundancy but requiring more joins. Star schemas are preferred for analytics workloads; snowflake schemas are sometimes used when a dimension is very large and has many redundant attribute values.

What are 1NF, 2NF, and 3NF, and when would you intentionally denormalize?

1NF eliminates repeating groups and requires atomic column values. 2NF further removes partial dependencies on a composite key. 3NF removes transitive dependencies — every non-key column must depend on the key, the whole key, and nothing but the key. Denormalization trades update anomalies for read performance, and is appropriate when the read path dominates and write correctness can be enforced at the application layer or with materialized views.

What is the difference between a star schema and a snowflake schema, and which should you choose?

A star schema has a central fact table joined directly to denormalized dimension tables, giving simple two-table joins and fast query performance at the cost of some data redundancy. A snowflake schema normalizes dimensions into sub-dimension tables, reducing storage and update anomalies but requiring more joins that can slow analytical queries.

Explore further

Glossary terms

Snowflake Schema Star Schema Dimension Table Fact Table

Cheat sheets

SQL NumPy scikit-learn