dataarchitect.studio

Field Notes

Slowly Changing Dimensions, Explained Without the Jargon

“Slowly changing dimensions” is an intimidating name for a simple idea. A dimension describes context — a customer, a product, a store. Over time, that context changes: the customer moves city, the product gets recategorised, the store switches region. A slowly changing dimension, or SCD, is just a strategy for what happens to history when one of those attributes changes. That’s the entire concept. The types — 1, 2, 3 — are only different answers to one question: do you keep the old value, or overwrite it?

If you’ve read the field guide to dimensional modeling, you’ve met this idea in passing. Here’s the full version.

The question every dimension eventually asks

Say you have a customer dimension, and the customer C-1077 was in the “Small Business” segment when they placed an order last year. This year they’ve grown and been moved to “Enterprise.” Someone now runs a report: sales by segment, last year. Should C-1077’s old orders count under “Small Business” (what they were at the time) or “Enterprise” (what they are now)?

There is no universally correct answer — only a business decision. SCD types are the menu of ways to implement whichever answer your business needs.

An SCD type is a technical setting that encodes a business choice: when reality changes, does our history change with it, or stay as it was?

Type 1 — Overwrite (forget the past)

The simplest strategy: when an attribute changes, you just update the value in place. The old value is gone.

UPDATE dim_customer
   SET segment = 'Enterprise'
 WHERE customer_id = 'C-1077';

Now all of C-1077’s orders — past and future — report under “Enterprise,” because that’s the only value the dimension remembers. Last year’s “sales by segment” will retroactively change.

Use Type 1 when history doesn’t matter for that attribute — correcting a misspelled name, fixing a data-entry error, or tracking a value where only the current state is ever meaningful. It’s cheap and easy. The cost is amnesia: you can never again ask what things looked like before.

Type 2 — Add a new row (keep full history)

This is the important one, the strategy people usually mean when they say “SCD.” On change, you don’t overwrite — you expire the old row and insert a new one. The customer now exists as two rows: the historical version and the current version, each valid for a span of time.

dim_customer
+-------------+-------------+------------+------------+------------+------------+
| customer_key| customer_id | segment    | valid_from | valid_to   | is_current |
+-------------+-------------+------------+------------+------------+------------+
| 4401        | C-1077      | Small Bus. | 2024-02-01 | 2026-03-14 | false      |
| 8902        | C-1077      | Enterprise | 2026-03-15 | 9999-12-31 | true       |
+-------------+-------------+------------+------------+------------+------------+

Each fact (each order) points to the customer_key that was current when the order happened. Last year’s orders reference key 4401 (“Small Business”); this year’s reference 8902 (“Enterprise”). History stays honest — “sales by segment, last year” returns what was actually true then, and it never changes.

Two things make Type 2 work, and both are worth noticing. First, the validity columns (valid_from, valid_to, is_current) are what let two versions of “the same” customer coexist and be queried by date. Second — and this is why I keep linking them — Type 2 is only possible because of a surrogate key. The natural key C-1077 is identical on both rows; it’s the meaningless surrogate (4401 vs 8902) that distinguishes the versions and gives facts something stable to point at. Without a surrogate key, you simply cannot represent history this way.

The cost of Type 2 is that the dimension grows over time and your pipeline gets more complex — it has to detect changes, expire old rows, and insert new ones correctly. For most analytics where history matters, that complexity is worth paying.

Type 3 — Add a column (keep limited history)

A middle option, used far less often. Instead of a new row, you keep both the old and new value side by side in columns:

customer_id | current_segment | previous_segment | segment_changed_on
C-1077      | Enterprise      | Small Business   | 2026-03-15

This preserves exactly one step of history — you can see the current and the immediately prior value, but nothing further back. Type 3 fits the narrow case where you care about “before and after” a single known transition (a re-org, a re-branding) and don’t need full history. It’s a specialist tool; reach for Type 2 when you want history in general, and Type 3 only for this specific shape of question.

Which to use

The choice is per-attribute, not per-table — a single dimension can mix strategies. A customer’s segment might be Type 2 (you want historical accuracy for reporting), while a typo correction in their name is Type 1 (no one wants to preserve the misspelling). Decide attribute by attribute:

  • Type 1 — you only ever care about the current value, or you’re fixing an error.
  • Type 2 — you need to report on history as it actually was. This is the workhorse; when in doubt, this is usually the right default for attributes that carry analytical meaning.
  • Type 3 — you need exactly one prior value around a specific, known change.

The trap to avoid

The classic mistake isn’t choosing the wrong type — it’s not choosing at all. Teams default to silent Type 1 (overwriting) simply because that’s what a plain UPDATE does, and only discover the problem months later when someone asks for a historical breakdown and finds the past has been quietly rewritten. By then the old values are gone and unrecoverable.

So make the decision deliberately, attribute by attribute, before the data starts flowing. That act — deciding what history means before you need it — is exactly the kind of small, deliberate choice that separates a designed dimension from one that merely accumulated. Slowly changing dimensions aren’t arcane. They’re just the place where you decide, on purpose, what your data is allowed to forget.