dataarchitect.studio

Field Notes

The Grain of a Fact Table: The First Decision That Decides Everything Else

The grain of a fact table is the business definition of what exactly one row represents. Not the list of columns — a sentence: one row per order line, one row per account per month, one row per boarding-pass scan. Declaring that sentence is the first design decision in dimensional modeling, made before you pick a single dimension or measure, because it is the test every later choice must pass. A dimension belongs in the table only if it’s single-valued at that grain; a measure belongs only if it’s true at that grain. Skip the declaration and both mistakes walk in unnoticed — which is why nearly every double-counting bug in a warehouse traces back to a grain violation.

Kimball’s design method makes this order explicit: choose the business process, declare the grain, then — and only then — choose dimensions and facts. The grain sits second because everything downstream inherits from it.

Atomic grain first, always

The safest declaration is the most atomic one — the finest level of detail the source’s measurement event actually produces. For a retail order that’s the order line, not the order; for web traffic, the page view, not the session.

  Atomic grain (order line) Coarse grain (order)
“Sales by product?” Yes — product is a line attribute Impossible — products vary within an order
Unanticipated questions Survives them Fails the first new one
Roll up / drill down Aggregates up freely Can never drill back down
Row volume Larger Smaller
Right role The foundation fact table A derived aggregate on top

The asymmetry is the whole argument: atomic data can always become coarse, coarse data can never become atomic again. If query cost hurts, add an aggregate table (or let a semantic layer manage one) — as a second table with its own declared grain, not a compromise in the first.

The classic bug: two grains in one table

Here’s how the mistake usually arrives. Orders have lines (product, quantity, price) — but the source also carries an order-level shipping fee. Someone helpfully adds shipping_fee to the line-grain table, repeated on every line:

-- Grain: one row per order line ... except one column isn't
SELECT
  order_id, product_key, quantity,
  line_amount,     -- true at line grain ✓
  shipping_fee     -- true at ORDER grain ✗ (repeated on every line)
FROM fact_order_line;

-- The inevitable report, quietly wrong:
SELECT sum(line_amount) AS revenue,
       sum(shipping_fee) AS shipping   -- 3-line orders count fees 3x
FROM fact_order_line;

The fix is never “remember not to sum that column.” It’s one of two structural moves:

order-level fact arrives (shipping fee, at the wrong grain) Fix 1 — allocate down split the fee across lines table stays at line grain Fix 2 — second fact table fact_order, one row per order its own declared grain sums are safe at every grain each table sums correctly alone what never works: storing both grains in one table and hoping analysts remember
Two honest fixes for a fact at the wrong grain — allocation, or a separate fact table.
-- Fix 1: allocate the order fact down to line grain
shipping_fee * line_amount / sum(line_amount) OVER (PARTITION BY order_id)
  AS allocated_shipping_fee

Allocation keeps one table and makes every column summable; a second fact table keeps each measurement honest at its natural grain. Both work. Hoping people remember doesn’t.

Grain declares the fact table’s type

Say the grain sentence out loud and you’ve usually also named which of the three fact table types you’re building: one row per event is a transaction fact, one row per thing per period is a periodic snapshot, one row per process lifecycle is an accumulating snapshot. Even factless fact tables — rows with no measures at all — still have a perfectly crisp grain; the row itself is the fact.

The grain also disciplines dimensions. Ask of each candidate: is it single-valued at this grain? Product is single-valued per order line but multivalued per order — so a line grain earns the product dimension and an order grain forfeits it. That one test, applied ruthlessly, is most of what separates a star schema that holds up from a wide table that slowly stops being true.

Declare the grain in one sentence. Write it at the top of the model file. Reject anything that violates it. It is the cheapest correctness guarantee in all of data architecture.

Common questions

What is the grain of a fact table?

The grain is the precise business definition of what a single row in the fact table represents — for example, 'one row per order line' or 'one row per account per month.' It's declared in business terms, before choosing dimensions or measures, and every element of the table must honor it.

How do I choose the right grain?

Start at the most atomic grain the source system produces — the individual measurement event, like an order line or a sensor reading. Atomic grain answers unpredictable questions and can always be aggregated up; a coarser grain can never be drilled back down. Only pre-aggregate as a deliberate performance layer on top.

Can one fact table have two grains?

No — mixing grains in one table is the classic dimensional modeling bug. If a business process produces measurements at two grains (order lines and order-level shipping fees, say), either allocate the coarser fact down to the fine grain or build two fact tables. Never store both grains in the same table.

What's the difference between grain and granularity?

In dimensional modeling they're used interchangeably, but 'grain' is the Kimball term of art for the declared meaning of one row. 'Granularity' more loosely describes how detailed data is. The discipline lies in the declaration: writing the grain down as a sentence and rejecting anything that violates it.