datarekha

Dimensional Modeling: Star & Snowflake

How analytical tables are actually shaped — facts vs dimensions, declaring the grain, and why the star schema's denormalized dimensions beat a normalized one for BI. Star vs snowflake, surrogate keys, and the Kimball rule of thumb.

9 min read Beginner SQL Lesson 20 of 27

What you'll learn

  • The two table types every warehouse is built from — facts and dimensions
  • Why "declaring the grain" is the first and most important modeling decision
  • The star schema — denormalized dimensions, one join each, built for BI
  • The snowflake schema — normalized dimensions, fewer redundancies, more joins
  • Why warehouses use integer surrogate keys instead of natural business keys

Before you start

Your production database is normalized to death — and rightly so. Every fact lives in exactly one place, updates are cheap, and nothing contradicts itself. But the moment an analyst asks “revenue by region by product category by month,” that beautiful normalized schema becomes a twelve-join nightmare. Dimensional modeling is the discipline of re-shaping data for questions instead of transactions — and almost every warehouse, dashboard, and cube you’ll ever touch is built on its two ideas: facts and dimensions.

Two kinds of tables: facts and dimensions

Dimensional modeling (popularized by Ralph Kimball in the 1990s) says every analytical table is one of exactly two types.

A fact table records measurements of events — what happened, as numbers. One row per event (a sale, a click, a shipment). It is tall and thin: millions or billions of rows, but only a few columns — some numeric measures (quantity, amount, discount) plus foreign keys pointing at the dimensions. Facts are where the math happens: you SUM, COUNT, and AVG them.

A dimension table holds the descriptive context — the who, what, where, and when. One row per thing (a product, a customer, a store, a date). It is short and wide: relatively few rows, but many descriptive text columns you filter and group by (product_name, category, brand, color). Dimensions are where the labels live.

fact_salesquantity, amount + keysdim_datedim_customerdim_productdim_store
A star schema: one central fact table surrounded by the dimensions that describe it.

A simple test: if you’d SUM it, it’s probably a measure in a fact table. If you’d GROUP BY it, it’s probably an attribute in a dimension. Revenue is a fact; the product category you slice it by is a dimension.

The grain: declare it before anything else

Before you add a single column, you answer one question: what does one row of the fact table mean? That is the grain, and Kimball’s first rule is declare the grain first.

“One row per order line” is a fine grain. So is “one row per order” or “one row per product per store per day.” But you must pick one and state it, because the grain decides everything downstream: which dimensions can attach, what a COUNT(*) means, and whether your sums are trustworthy.

The star schema: denormalized dimensions, one join each

The default dimensional design is the star schema: the fact table in the middle, each dimension hanging directly off it by a foreign key — the shape literally looks like a star.

The defining move is denormalization. A dim_product row carries everything about the product in one flat, wide table — product_name, category, subcategory, brand, department, all side by side — even though category and department repeat across thousands of products. That redundancy is deliberate. It means “revenue by department” is a single join: fact_sales ⋈ dim_product. Fewer joins, simpler SQL, and a shape that BI tools and columnar engines chew through fast.

The snowflake schema: normalize the dimensions

A snowflake schema takes those wide dimensions and normalizes them into sub-tables. Instead of department sitting on every product row, dim_product points to a dim_category, which points to a dim_department. The dimension “branches” outward like a snowflake.

The trade is exactly the one normalization always makes: you remove redundancy (each department name stored once) at the cost of more tables and more joins. “Revenue by department” is no longer one join — it’s fact ⋈ product ⋈ category ⋈ department, three hops out.

Flip between the two and watch the join count move:

TryStar ↔ Snowflake

One fact table, two ways to arrange the dimensions

Flip the layout and watch dim_product and dim_store normalise into branch tables. The highlighted path is the join for "total revenue by department" — count the hops.

2 tables1 join
dim_datedim_customerdim_productdim_storefact_salesmeasures + keys

Star: every dimension is one wide, denormalised table — department is a column on dim_product, so the query is a single join. Simple and fast.

Which one should you use?

Kimball’s guidance, and the modern default, is star. Here’s why the snowflake’s tidiness usually loses:

  • Storage is cheap; joins are expensive. The redundant text in a star costs a little disk. The extra joins in a snowflake cost query time and complexity on every analytical query — the hot path.
  • Columnar engines erase the downside. Warehouses like BigQuery, Snowflake (the product), and Redshift store columns compressed, so a repeated department string barely costs anything (you’ll see exactly why in the columnar-storage lesson).
  • Star is easier for humans and BI tools. One join, obvious shape, fewer ways to get it wrong.

Snowflaking earns its keep in narrow cases — a genuinely huge dimension where the redundancy is real money, or a sprawling hierarchy you must maintain in one place. Otherwise: keep it flat.

Surrogate keys: why the fact points at 42, not "SKU-9981"

One detail you’ll see everywhere: dimension tables get a surrogate key — a meaningless auto-incrementing integer (product_key = 42) — and the fact table stores that, not the natural business key ("SKU-9981").

Why bother? Three reasons. Integer joins are faster and smaller than joining on long text keys. Surrogates insulate the warehouse from the source system renumbering or reusing its IDs. And — the big one — surrogate keys are what let a dimension track history: when a product’s category changes, you can keep the old version and the new one as two rows with two different surrogate keys, while the natural key stays the same. That technique is the whole next lesson.

Quick check

Quick check

0/3
Q1What distinguishes a fact table from a dimension table?
Q2Why does a star schema deliberately denormalize its dimensions (e.g. storing 'department' on every product row)?
Q3TRANSFER: A team declares their fact_sales grain as 'one row per order line.' Later someone adds per-order shipping fees as extra rows in the same table. What breaks?

Next

You now have the warehouse’s blueprint: a fact table at a declared grain, ringed by denormalized dimensions, joined on surrogate keys. But dimensions aren’t frozen — customers move, products get recategorized, sales reps change territory. How do you record a measurement against yesterday’s version of a dimension without rewriting history? That’s slowly changing dimensions — next.

FAQCommon questions

Questions about this lesson

What is the difference between a fact table and a dimension table?

A fact table stores numeric measurements of events — one row per event (a sale, a click), with measures you SUM and foreign keys to dimensions. A dimension table stores descriptive context — one row per thing (a product, a customer) with text attributes you filter and GROUP BY. Quick test: if you would SUM it, it is a fact; if you would GROUP BY it, it is a dimension.

What is the grain of a fact table?

The grain is the precise definition of what one row of the fact table represents — for example, one row per order line, or one row per product per store per day. Kimball's first rule is to declare the grain before adding any columns, because it determines which dimensions can attach and whether your aggregates are correct. Mixing two grains in one table makes every SUM and COUNT unreliable.

Star schema vs snowflake schema — which should I use?

Use a star schema by default. A star keeps each dimension as one wide, denormalized table so every query is a single join per dimension; a snowflake normalizes dimensions into sub-tables, removing redundancy at the cost of more joins. Storage is cheap and columnar engines compress the redundancy away, so the star's simpler, faster joins almost always win.

Practice this in an interview

All questions
What is the difference between a star schema and a snowflake schema in dimensional modeling?

A star schema has a central fact table joined directly to denormalized dimension tables — one join hop per dimension, simple queries, better query performance. A snowflake schema normalizes dimension tables into sub-dimensions, reducing storage redundancy but requiring more joins. Star schemas are preferred for analytics workloads; snowflake schemas are sometimes used when a dimension is very large and has many redundant attribute values.

What is the difference between a star schema and a snowflake schema, and which should you choose?

A star schema has a central fact table joined directly to denormalized dimension tables, giving simple two-table joins and fast query performance at the cost of some data redundancy. A snowflake schema normalizes dimensions into sub-dimension tables, reducing storage and update anomalies but requiring more joins that can slow analytical queries.

Should you normalize or denormalize tables in a data warehouse, and why?

Data warehouses favor denormalization — wide, flat tables that trade storage for query simplicity and performance. Normalization (splitting tables to eliminate redundancy) reduces storage but multiplies join hops, increasing query complexity and optimizer cost. In columnar warehouses with compression, the storage cost of redundancy is negligible, so denormalized star schemas consistently outperform normalized models for analytical workloads.

What are 1NF, 2NF, and 3NF, and when would you intentionally denormalize?

1NF eliminates repeating groups and requires atomic column values. 2NF further removes partial dependencies on a composite key. 3NF removes transitive dependencies — every non-key column must depend on the key, the whole key, and nothing but the key. Denormalization trades update anomalies for read performance, and is appropriate when the read path dominates and write correctness can be enforced at the application layer or with materialized views.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content