Slowly Changing Dimensions
When a customer moves or a product is recategorized, do you overwrite the past or remember it? SCD Type 1, 2, and 3 explained — the surrogate-key versioning trick behind Type 2, why it's the warehouse workhorse, and how facts point at the right version automatically.
What you'll learn
- The one question that defines SCD — keep the history or overwrite it
- Type 1 (overwrite), Type 2 (new versioned row), Type 3 (previous-value column)
- How surrogate keys + valid_from/valid_to/is_current make Type 2 work
- Why a fact row automatically links to the dimension version that was current at event time
- Where the stream of changes comes from, and the MERGE that applies it
Before you start
A dimension is supposed to describe a thing — a customer, a product, a store. But things change. A customer moves city. A product gets recategorized. A sales rep is reassigned to a new territory. The question that defines slowly changing dimensions (SCD) is deceptively small: when the description changes, do you overwrite the past, or remember it? Your answer changes the table’s shape, not just its data.
The whole decision: history or not
Every SCD strategy is an answer to “what happens to the old value?” There are three you’ll meet constantly.
Pick a type below and apply the move — notice the columns change, not just the cells. That’s the tell: choosing an SCD type is a schema decision you make up front.
Maria moves from Austin to Denver — three ways to record it
Pick an SCD type, then apply the move. Watch how the dimension's columns and rows change — that's the whole decision.
| key | id | name | city | valid_from | valid_to | current |
|---|---|---|---|---|---|---|
| 1 | C1 | Arun | Seattle | 2020-01-01 | — | true |
| 2 | C2 | Maria | Austin | 2020-01-01 | — | true |
| 3 | C3 | Wei | Boston | 2020-01-01 | — | true |
Type 2 — New row. The old row is expired (valid_to set, current → false) and a brand-new row with a new surrogate key is inserted. Facts before the move still point at key 2; new facts point at key 4. History is preserved.
Type 1 — overwrite (forget the past)
The simplest: UPDATE dim_customer SET city = 'Denver' WHERE id = 'C2'.
One row, no extra columns, no history. The catch is that the past is
rewritten — last quarter’s reports, re-run today, now say Denver as if
Maria had always lived there.
That’s exactly right for corrections (a misspelled name, a fat-fingered ZIP) where the old value was simply wrong. It’s exactly wrong when the old value was true at the time and someone might analyze that history. Use Type 1 when nobody will ever ask “what was it back then?”
Type 2 — add a new row (the workhorse)
Type 2 is the one you’ll use most. Instead of overwriting, you expire the old row and insert a new one:
- The existing row gets
valid_to = '2024-03-01'andis_current = false. - A brand-new row is inserted with a new surrogate key, the new city,
valid_from = '2024-03-01',valid_to = NULL,is_current = true.
The natural key (C2) stays the same across both rows; the surrogate
key is what differs. And here’s the elegant part: a fact recorded in
January stored customer_key = 2 (the version current then), so it
permanently points at “Austin Maria.” A sale in April stores
customer_key = 4 and points at “Denver Maria.” History reconstructs
itself — each fact already references the dimension version that was
true when the event happened. This is why surrogate keys exist.
Type 3 — a previous-value column (one step back)
Type 3 adds a column: current_city and previous_city. It remembers
exactly one prior value — useful for “compare this year’s region to
last year’s” — but it can’t hold a third city, and it doesn’t tell you
when the change happened. It’s a niche tool; Type 2 is the general
answer. (There’s also a Type 6 — the name is a nod to 1+2+3 — that
combines a current-value column on top of Type 2 rows, so you get full
history and a fast “current” lookup at once.)
Where the changes come from, and how they’re applied
Two practical questions remain. How do you learn a row changed? In a
modern stack that stream of inserts/updates/deletes usually arrives via
Change Data Capture — the next lesson. How do you apply it? With a
single MERGE (upsert): match on the natural key, expire the matched
current row, and insert the new version — all in one atomic statement.
That’s the same MERGE you met in Delta Lake, and SCD Type 2 is its
canonical use.
Quick check
Quick check
Next
You can now model a warehouse and version its history. But all of this assumes the data is even in the warehouse, stored in a way that scans fast. Next we drop below the logical model to the physical one: columnar storage and Parquet — why an analytical engine can sum a billion-row fact table while reading almost none of it.
Questions about this lesson
What are slowly changing dimensions (SCD)?
Slowly changing dimensions are techniques for handling changes to dimension attributes over time — when a customer moves city or a product is recategorized. The core question is whether to overwrite the old value or preserve it. The common strategies are Type 1 (overwrite), Type 2 (add a new versioned row), and Type 3 (keep one previous value in an extra column).
What is the difference between SCD Type 1 and Type 2?
Type 1 overwrites the attribute in place, so history is lost and past reports change — fine for corrections. Type 2 expires the old row (setting valid_to and is_current = false) and inserts a new row with a new surrogate key, preserving full history so facts keep pointing at the version that was current when they happened. Type 2 is the warehouse workhorse.
Why do SCD Type 2 dimensions use a surrogate key?
Because Type 2 stores multiple rows for the same real-world entity (same natural key, different versions), each version needs its own unique key — the surrogate key. A fact recorded at a given time stores the surrogate key of the version current then, so historical facts automatically reference historical attributes without any rewriting.
Practice this in an interview
All questionsSlowly changing dimensions (SCDs) handle attributes that change over time — a customer moving cities, a product changing category. Type 1 overwrites the old value and loses history. Type 2 inserts a new row with effective and expiry dates, preserving the full history of what was true at any point in time. Type 2 is the standard when accurate historical reporting matters.
Schema evolution covers adding, renaming, removing, or retyping columns in a data stream or table over time. Safe strategies include: only adding nullable columns (backwards-compatible), using schema registries to enforce compatibility rules before a producer publishes, and open table formats like Iceberg that track schema history and allow column renames and reorders without rewriting data.
CDC continuously captures row-level inserts, updates, and deletes from a source database and streams them downstream — enabling near-real-time replication to a warehouse or data lake without full table scans. The most robust implementation reads the database's write-ahead log (WAL), making it low-impact on the source and capable of capturing deletes that polling-based approaches miss entirely.
Data warehouses favor denormalization — wide, flat tables that trade storage for query simplicity and performance. Normalization (splitting tables to eliminate redundancy) reduces storage but multiplies join hops, increasing query complexity and optimizer cost. In columnar warehouses with compression, the storage cost of redundancy is negligible, so denormalized star schemas consistently outperform normalized models for analytical workloads.