Field Notes
Normalization vs Denormalization: When Each Wins
Normalization and denormalization are opposite answers to the same question: should data live in many small related tables, or fewer wide ones? Normalization splits data apart to eliminate redundancy, which is ideal for systems that write constantly. Denormalization deliberately combines it, accepting redundancy to eliminate joins, which is ideal for systems that mostly read. Neither is “correct” — the right choice follows your workload, and getting it backwards is a common, expensive mistake. Here’s the trade-off and how it decides your warehouse design.
At a glance
| Normalized | Denormalized | |
|---|---|---|
| Structure | Many small related tables | Fewer, wider tables |
| Redundancy | Minimal — each fact stored once | Accepted — values repeat |
| Joins to query | Many | Few or none |
| Write/update speed | Fast, update in one place | Slower, update many copies |
| Read speed | Slower (join cost) | Faster (no joins) |
| Integrity | Strong by design | Must be maintained |
| Best for | Transactional systems (OLTP) | Analytics & reporting (OLAP) |
What normalization is
Normalization organizes data into multiple related tables so each fact is stored
exactly once. A customer’s address lives in one row of a customers table; an order
references that customer by key rather than copying the address into every order. The
goal is to eliminate redundancy, and the payoff is write efficiency and integrity:
update the address in one place and every order reflects it instantly, with no chance
of conflicting copies. This is why transactional
systems — which are all about fast, correct, concurrent
writes — are heavily normalized. It’s the right shape when the workload is changing
data.
What denormalization is
Denormalization deliberately does the opposite: it combines related data into wider tables, accepting repetition, to eliminate joins. The customer’s name and region get copied directly onto each order row, even though that repeats across thousands of orders. You’re trading storage and update-efficiency for one thing: read speed and simplicity. A query that would have joined five normalized tables now reads one, which is faster and far easier to write.
A worked example
Ask for revenue by customer region. Normalized, the region sits in a separate
customers table, so you join to reach it:
-- NORMALIZED: join out to get region
SELECT c.region, SUM(o.amount) AS revenue
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
GROUP BY c.region;
Denormalized, region is already on each order row, so there’s no join at all:
-- DENORMALIZED: region is right there
SELECT region, SUM(amount) AS revenue
FROM orders_wide
GROUP BY region;
The second is faster and simpler to read — but region now repeats on every order, and
if a customer’s region changes you must update every one of their rows. That’s the
trade in miniature: fewer joins, more redundancy.
The trade-off, and how to choose
It comes down to write-efficiency versus read-efficiency. Normalize when data changes often and integrity is paramount — store each fact once so updates are cheap and safe. Denormalize when data is read far more than it’s written and query speed matters — accept redundancy so reads avoid joins. The danger of denormalization is exactly the redundancy: a repeated value updated in one place but not another creates inconsistency, which is why it suits read-mostly data far better than churny transactional data.
Why warehouses denormalize
This is the key insight for analytics: a data warehouse is overwhelmingly read-heavy — written by a handful of pipelines, queried by everyone — so it should optimize for reads, which means denormalizing. That’s the entire reason dimensional models and star schemas are denormalized: dimensions are flattened into wide tables so analytical queries need fewer joins, and the redundancy is safe because the data is loaded, not constantly edited. Take denormalization to its limit and you arrive at one big table; pull a star back toward normalization and you get a snowflake.
So the rule of thumb writes itself. Transactional system, changing constantly → normalize. Analytical system, read constantly → denormalize. Match the shape of the data to the shape of the workload, and most “should I normalize this?” debates answer themselves.
Common questions
What is the difference between normalization and denormalization?
Normalization splits data into multiple related tables to eliminate redundancy, so each fact is stored once. Denormalization deliberately combines data into fewer, wider tables, accepting redundancy to eliminate joins. Normalization optimizes for writes and integrity; denormalization optimizes for read speed and simplicity.
Is a star schema normalized or denormalized?
Denormalized. A star schema flattens each dimension into a single wide table rather than splitting it into related sub-tables, precisely so analytical queries need fewer joins. Normalizing those dimensions instead turns a star schema into a snowflake schema.
When should you denormalize?
When the workload is read-heavy and join-heavy — analytics and reporting — and read performance and query simplicity matter more than storage or update efficiency. Data warehouses are denormalized for exactly this reason; transactional systems usually stay normalized.
Does denormalization cause data inconsistency?
It can. Because a value is repeated across many rows, updating it means updating every copy, and missing one creates inconsistency. That's why denormalization suits read-mostly analytical data — written once by a pipeline — far better than frequently updated transactional data.
Essays by email
One new essay on data architecture, straight to your inbox. No noise, unsubscribe anytime.