Dimensional Modeling: Star & Snowflake
How analytical tables are actually shaped — facts vs dimensions, declaring the grain, and why the star schema's denormalized dimensions beat a normalized one for BI. Star vs snowflake, surrogate keys, and the Kimball rule of thumb.
What you'll learn
- The two table types every warehouse is built from — facts and dimensions
- Why "declaring the grain" is the first and most important modeling decision
- The star schema — denormalized dimensions, one join each, built for BI
- The snowflake schema — normalized dimensions, fewer redundancies, more joins
- Why warehouses use integer surrogate keys instead of natural business keys
Before you start
Your production database is normalized to death — and rightly so. Every fact lives in exactly one place, updates are cheap, and nothing contradicts itself. But the moment an analyst asks “revenue by region by product category by month,” that beautiful normalized schema becomes a twelve-join nightmare. Dimensional modeling is the discipline of re-shaping data for questions instead of transactions — and almost every warehouse, dashboard, and cube you’ll ever touch is built on its two ideas: facts and dimensions.
Two kinds of tables: facts and dimensions
Dimensional modeling (popularized by Ralph Kimball in the 1990s) says every analytical table is one of exactly two types.
A fact table records measurements of events — what happened, as
numbers. One row per event (a sale, a click, a shipment). It is tall and
thin: millions or billions of rows, but only a few columns — some
numeric measures (quantity, amount, discount) plus foreign
keys pointing at the dimensions. Facts are where the math happens: you
SUM, COUNT, and AVG them.
A dimension table holds the descriptive context — the who, what,
where, and when. One row per thing (a product, a customer, a store, a
date). It is short and wide: relatively few rows, but many descriptive
text columns you filter and group by (product_name, category, brand,
color). Dimensions are where the labels live.
A simple test: if you’d SUM it, it’s probably a measure in a fact
table. If you’d GROUP BY it, it’s probably an attribute in a
dimension. Revenue is a fact; the product category you slice it by is a
dimension.
The grain: declare it before anything else
Before you add a single column, you answer one question: what does one row of the fact table mean? That is the grain, and Kimball’s first rule is declare the grain first.
“One row per order line” is a fine grain. So is “one row per order”
or “one row per product per store per day.” But you must pick one and
state it, because the grain decides everything downstream: which
dimensions can attach, what a COUNT(*) means, and whether your sums are
trustworthy.
The star schema: denormalized dimensions, one join each
The default dimensional design is the star schema: the fact table in the middle, each dimension hanging directly off it by a foreign key — the shape literally looks like a star.
The defining move is denormalization. A dim_product row carries
everything about the product in one flat, wide table —
product_name, category, subcategory, brand, department, all
side by side — even though category and department repeat across
thousands of products. That redundancy is deliberate. It means
“revenue by department” is a single join: fact_sales ⋈ dim_product.
Fewer joins, simpler SQL, and a shape that BI tools and columnar engines
chew through fast.
The snowflake schema: normalize the dimensions
A snowflake schema takes those wide dimensions and normalizes them
into sub-tables. Instead of department sitting on every product row,
dim_product points to a dim_category, which points to a
dim_department. The dimension “branches” outward like a snowflake.
The trade is exactly the one normalization always makes: you remove
redundancy (each department name stored once) at the cost of more
tables and more joins. “Revenue by department” is no longer one join —
it’s fact ⋈ product ⋈ category ⋈ department, three hops out.
Flip between the two and watch the join count move:
One fact table, two ways to arrange the dimensions
Flip the layout and watch dim_product and dim_store normalise into branch tables. The highlighted path is the join for "total revenue by department" — count the hops.
Star: every dimension is one wide, denormalised table — department is a column on dim_product, so the query is a single join. Simple and fast.
Which one should you use?
Kimball’s guidance, and the modern default, is star. Here’s why the snowflake’s tidiness usually loses:
- Storage is cheap; joins are expensive. The redundant text in a star costs a little disk. The extra joins in a snowflake cost query time and complexity on every analytical query — the hot path.
- Columnar engines erase the downside. Warehouses like BigQuery,
Snowflake (the product), and Redshift store columns compressed, so a
repeated
departmentstring barely costs anything (you’ll see exactly why in the columnar-storage lesson). - Star is easier for humans and BI tools. One join, obvious shape, fewer ways to get it wrong.
Snowflaking earns its keep in narrow cases — a genuinely huge dimension where the redundancy is real money, or a sprawling hierarchy you must maintain in one place. Otherwise: keep it flat.
Surrogate keys: why the fact points at 42, not "SKU-9981"
One detail you’ll see everywhere: dimension tables get a surrogate
key — a meaningless auto-incrementing integer (product_key = 42) — and
the fact table stores that, not the natural business key ("SKU-9981").
Why bother? Three reasons. Integer joins are faster and smaller than joining on long text keys. Surrogates insulate the warehouse from the source system renumbering or reusing its IDs. And — the big one — surrogate keys are what let a dimension track history: when a product’s category changes, you can keep the old version and the new one as two rows with two different surrogate keys, while the natural key stays the same. That technique is the whole next lesson.
Quick check
Quick check
Next
You now have the warehouse’s blueprint: a fact table at a declared grain, ringed by denormalized dimensions, joined on surrogate keys. But dimensions aren’t frozen — customers move, products get recategorized, sales reps change territory. How do you record a measurement against yesterday’s version of a dimension without rewriting history? That’s slowly changing dimensions — next.
Questions about this lesson
What is the difference between a fact table and a dimension table?
A fact table stores numeric measurements of events — one row per event (a sale, a click), with measures you SUM and foreign keys to dimensions. A dimension table stores descriptive context — one row per thing (a product, a customer) with text attributes you filter and GROUP BY. Quick test: if you would SUM it, it is a fact; if you would GROUP BY it, it is a dimension.
What is the grain of a fact table?
The grain is the precise definition of what one row of the fact table represents — for example, one row per order line, or one row per product per store per day. Kimball's first rule is to declare the grain before adding any columns, because it determines which dimensions can attach and whether your aggregates are correct. Mixing two grains in one table makes every SUM and COUNT unreliable.
Star schema vs snowflake schema — which should I use?
Use a star schema by default. A star keeps each dimension as one wide, denormalized table so every query is a single join per dimension; a snowflake normalizes dimensions into sub-tables, removing redundancy at the cost of more joins. Storage is cheap and columnar engines compress the redundancy away, so the star's simpler, faster joins almost always win.
Practice this in an interview
All questionsA star schema has a central fact table joined directly to denormalized dimension tables — one join hop per dimension, simple queries, better query performance. A snowflake schema normalizes dimension tables into sub-dimensions, reducing storage redundancy but requiring more joins. Star schemas are preferred for analytics workloads; snowflake schemas are sometimes used when a dimension is very large and has many redundant attribute values.
A star schema has a central fact table joined directly to denormalized dimension tables, giving simple two-table joins and fast query performance at the cost of some data redundancy. A snowflake schema normalizes dimensions into sub-dimension tables, reducing storage and update anomalies but requiring more joins that can slow analytical queries.
Data warehouses favor denormalization — wide, flat tables that trade storage for query simplicity and performance. Normalization (splitting tables to eliminate redundancy) reduces storage but multiplies join hops, increasing query complexity and optimizer cost. In columnar warehouses with compression, the storage cost of redundancy is negligible, so denormalized star schemas consistently outperform normalized models for analytical workloads.
1NF eliminates repeating groups and requires atomic column values. 2NF further removes partial dependencies on a composite key. 3NF removes transitive dependencies — every non-key column must depend on the key, the whole key, and nothing but the key. Denormalization trades update anomalies for read performance, and is appropriate when the read path dominates and write correctness can be enforced at the application layer or with materialized views.