datarekha

Slowly Changing Dimensions

When a customer moves or a product is recategorized, do you overwrite the past or remember it? SCD Type 1, 2, and 3 explained — the surrogate-key versioning trick behind Type 2, why it's the warehouse workhorse, and how facts point at the right version automatically.

9 min read Intermediate SQL Lesson 21 of 27

What you'll learn

  • The one question that defines SCD — keep the history or overwrite it
  • Type 1 (overwrite), Type 2 (new versioned row), Type 3 (previous-value column)
  • How surrogate keys + valid_from/valid_to/is_current make Type 2 work
  • Why a fact row automatically links to the dimension version that was current at event time
  • Where the stream of changes comes from, and the MERGE that applies it

Before you start

A dimension is supposed to describe a thing — a customer, a product, a store. But things change. A customer moves city. A product gets recategorized. A sales rep is reassigned to a new territory. The question that defines slowly changing dimensions (SCD) is deceptively small: when the description changes, do you overwrite the past, or remember it? Your answer changes the table’s shape, not just its data.

The whole decision: history or not

Every SCD strategy is an answer to “what happens to the old value?” There are three you’ll meet constantly.

Pick a type below and apply the move — notice the columns change, not just the cells. That’s the tell: choosing an SCD type is a schema decision you make up front.

TrySCD Types

Maria moves from Austin to Denver — three ways to record it

Pick an SCD type, then apply the move. Watch how the dimension's columns and rows change — that's the whole decision.

dim_customerfull history
keyidnamecityvalid_fromvalid_tocurrent
1C1ArunSeattle2020-01-01true
2C2MariaAustin2020-01-01true
3C3WeiBoston2020-01-01true
3 rows

Type 2 — New row. The old row is expired (valid_to set, current → false) and a brand-new row with a new surrogate key is inserted. Facts before the move still point at key 2; new facts point at key 4. History is preserved.

Type 1 — overwrite (forget the past)

The simplest: UPDATE dim_customer SET city = 'Denver' WHERE id = 'C2'. One row, no extra columns, no history. The catch is that the past is rewritten — last quarter’s reports, re-run today, now say Denver as if Maria had always lived there.

That’s exactly right for corrections (a misspelled name, a fat-fingered ZIP) where the old value was simply wrong. It’s exactly wrong when the old value was true at the time and someone might analyze that history. Use Type 1 when nobody will ever ask “what was it back then?”

Type 2 — add a new row (the workhorse)

Type 2 is the one you’ll use most. Instead of overwriting, you expire the old row and insert a new one:

  • The existing row gets valid_to = '2024-03-01' and is_current = false.
  • A brand-new row is inserted with a new surrogate key, the new city, valid_from = '2024-03-01', valid_to = NULL, is_current = true.

The natural key (C2) stays the same across both rows; the surrogate key is what differs. And here’s the elegant part: a fact recorded in January stored customer_key = 2 (the version current then), so it permanently points at “Austin Maria.” A sale in April stores customer_key = 4 and points at “Denver Maria.” History reconstructs itself — each fact already references the dimension version that was true when the event happened. This is why surrogate keys exist.

dim_customer (Type 2)key 2 · C2 · Maria · Austinvalid_to 2024-03-01 · current falsekey 4 · C2 · Maria · Denvervalid_from 2024-03-01 · current truefacts point at the right versionJan sale → customer_key 2Apr sale → customer_key 4
Two rows, two surrogate keys, one natural key. Each fact already references the version that was current when it happened.

Type 3 — a previous-value column (one step back)

Type 3 adds a column: current_city and previous_city. It remembers exactly one prior value — useful for “compare this year’s region to last year’s” — but it can’t hold a third city, and it doesn’t tell you when the change happened. It’s a niche tool; Type 2 is the general answer. (There’s also a Type 6 — the name is a nod to 1+2+3 — that combines a current-value column on top of Type 2 rows, so you get full history and a fast “current” lookup at once.)

Where the changes come from, and how they’re applied

Two practical questions remain. How do you learn a row changed? In a modern stack that stream of inserts/updates/deletes usually arrives via Change Data Capture — the next lesson. How do you apply it? With a single MERGE (upsert): match on the natural key, expire the matched current row, and insert the new version — all in one atomic statement. That’s the same MERGE you met in Delta Lake, and SCD Type 2 is its canonical use.

Quick check

Quick check

0/3
Q1A customer's mailing address was entered with a typo and is simply wrong. Which SCD type fits?
Q2In SCD Type 2, what makes a January fact still point at the customer's January attributes after they move in March?
Q3TRANSFER: After a botched load, your Type 2 dim_customer has TWO rows for customer C2 both with is_current = true. What went wrong, and what's the symptom?

Next

You can now model a warehouse and version its history. But all of this assumes the data is even in the warehouse, stored in a way that scans fast. Next we drop below the logical model to the physical one: columnar storage and Parquet — why an analytical engine can sum a billion-row fact table while reading almost none of it.

FAQCommon questions

Questions about this lesson

What are slowly changing dimensions (SCD)?

Slowly changing dimensions are techniques for handling changes to dimension attributes over time — when a customer moves city or a product is recategorized. The core question is whether to overwrite the old value or preserve it. The common strategies are Type 1 (overwrite), Type 2 (add a new versioned row), and Type 3 (keep one previous value in an extra column).

What is the difference between SCD Type 1 and Type 2?

Type 1 overwrites the attribute in place, so history is lost and past reports change — fine for corrections. Type 2 expires the old row (setting valid_to and is_current = false) and inserts a new row with a new surrogate key, preserving full history so facts keep pointing at the version that was current when they happened. Type 2 is the warehouse workhorse.

Why do SCD Type 2 dimensions use a surrogate key?

Because Type 2 stores multiple rows for the same real-world entity (same natural key, different versions), each version needs its own unique key — the surrogate key. A fact recorded at a given time stores the surrogate key of the version current then, so historical facts automatically reference historical attributes without any rewriting.

Practice this in an interview

All questions
What are slowly changing dimensions, and how do Type 1 and Type 2 differ?

Slowly changing dimensions (SCDs) handle attributes that change over time — a customer moving cities, a product changing category. Type 1 overwrites the old value and loses history. Type 2 inserts a new row with effective and expiry dates, preserving the full history of what was true at any point in time. Type 2 is the standard when accurate historical reporting matters.

How do you handle schema evolution in data pipelines without breaking downstream consumers?

Schema evolution covers adding, renaming, removing, or retyping columns in a data stream or table over time. Safe strategies include: only adding nullable columns (backwards-compatible), using schema registries to enforce compatibility rules before a producer publishes, and open table formats like Iceberg that track schema history and allow column renames and reorders without rewriting data.

What is Change Data Capture (CDC) and how is it implemented?

CDC continuously captures row-level inserts, updates, and deletes from a source database and streams them downstream — enabling near-real-time replication to a warehouse or data lake without full table scans. The most robust implementation reads the database's write-ahead log (WAL), making it low-impact on the source and capable of capturing deletes that polling-based approaches miss entirely.

Should you normalize or denormalize tables in a data warehouse, and why?

Data warehouses favor denormalization — wide, flat tables that trade storage for query simplicity and performance. Normalization (splitting tables to eliminate redundancy) reduces storage but multiplies join hops, increasing query complexity and optimizer cost. In columnar warehouses with compression, the storage cost of redundancy is negligible, so denormalized star schemas consistently outperform normalized models for analytical workloads.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content