<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://dataarchitect.studio/feed.xml" rel="self" type="application/atom+xml" /><link href="https://dataarchitect.studio/" rel="alternate" type="text/html" hreflang="en" /><updated>2026-05-31T16:26:28+05:30</updated><id>https://dataarchitect.studio/feed.xml</id><title type="html">dataarchitect.studio</title><subtitle>Essays and field notes on data architecture, data modeling, dimensional modeling, data contracts, and the lakehouse — practical writing for data engineers and architects who design systems that make information trustworthy.</subtitle><author><name>dataarchitect.studio</name></author><entry><title type="html">Your AI Is Only as Good as Your Data Architecture</title><link href="https://dataarchitect.studio/essays/your-ai-is-only-as-good-as-your-data-architecture/" rel="alternate" type="text/html" title="Your AI Is Only as Good as Your Data Architecture" /><published>2026-05-31T09:30:00+05:30</published><updated>2026-05-31T09:30:00+05:30</updated><id>https://dataarchitect.studio/essays/your-ai-is-only-as-good-as-your-data-architecture</id><content type="html" xml:base="https://dataarchitect.studio/essays/your-ai-is-only-as-good-as-your-data-architecture/"><![CDATA[<p>There’s a comforting story in which generative AI makes data architecture less
important — point a clever enough model at your data and it’ll just figure things
out, schema be damned. The opposite is true. Every way an AI system touches your
data, it does so more credulously and at greater scale than any human ever did,
which means it is <em>more</em> exposed to bad structure, not less. GenAI doesn’t let you
skip the architecture. It raises the price of getting it wrong.</p>

<h2 id="the-new-consumers-dont-sanity-check">The new consumers don’t sanity-check</h2>

<p>For years, the consumers of your data were humans and dashboards. Humans have a
saving grace: when a number looks wrong, they squint at it. An analyst who sees
revenue triple overnight assumes a pipeline broke before they assume the company
did. That skepticism has quietly compensated for a lot of shaky data architecture.</p>

<p>The new consumers have no such instinct. A retrieval-augmented chatbot, an
autonomous agent, an LLM translating a question into SQL against your warehouse —
each will take whatever your data hands it and present the result with total,
fluent confidence.</p>

<blockquote>
  <p>A human analyst who gets a weird number investigates it. An LLM reports it — in a
complete sentence, with no hint that anything is off. AI removes the last layer of
human skepticism that was silently covering for bad data.</p>
</blockquote>

<p>This is the shift that matters. When the consumer stops double-checking, the
<em>structure</em> and <em>correctness</em> of the data have to carry the entire burden of trust.
That burden is exactly what data architecture exists to bear.</p>

<h2 id="rag-quality-is-retrieval-quality-is-data-quality">RAG quality is retrieval quality is data quality</h2>

<p>Take the most common pattern: retrieval-augmented generation, where a model answers
using documents pulled from your own data rather than from its training. RAG is
widely treated as a model problem — pick the right embedding model, tune the prompt.
But the quality of a RAG system is dominated by the quality of <em>retrieval</em>, and
retrieval quality is dominated by the state of the underlying data.</p>

<p>If your source documents are duplicated, contradictory, stale, or unlabelled, the
retriever faithfully surfaces duplicated, contradictory, stale, unlabelled context —
and the model dutifully reasons over garbage. Many “hallucinations” aren’t the model
inventing things; they’re the model accurately summarising bad or conflicting
retrieved data. The fix in those cases isn’t a better model. It’s deduplication,
clear metadata, freshness guarantees, and source-of-truth discipline — the same
<a href="/essays/the-shape-of-data/">deliberate shape</a> the rest of your data needs. A RAG
system sits directly on top of your data architecture and inherits every weakness in
it.</p>

<h2 id="when-an-ai-queries-your-warehouse">When an AI queries your warehouse</h2>

<p>The pattern with the highest stakes is letting a model query your warehouse directly
— natural-language-to-SQL, or an agent with database access. The moment you do this,
every latent ambiguity in your data becomes a live wire.</p>

<p>Suppose three tables each define “active user” slightly differently and none is
marked canonical. A human analyst eventually learns which one to use, through
folklore and scar tissue. An LLM has no folklore. It will pick a table, write
plausible SQL, and report a number — and on the next question it may pick a different
table and report a different number, each time with the same confidence. You’ve
automated the production of inconsistent answers.</p>

<p>This is precisely why a <a href="/essays/what-is-a-semantic-layer/">semantic layer</a> goes from
nice-to-have to load-bearing the instant AI enters the picture. If “active user” and
“revenue” are defined in exactly one governed place that the AI is made to query
<em>through</em>, the model can’t improvise its own definitions. Without that layer, an AI
data interface is a confident random-number generator wearing a tie. The semantic
layer is the guardrail, and AI is what makes the guardrail non-optional.</p>

<h2 id="governance-stops-being-paperwork">Governance stops being paperwork</h2>

<p>The same goes for the unglamorous disciplines that ambitious teams love to defer.
Who owns this dataset? What’s it allowed to mean? Is it fresh? Are the
<a href="/essays/data-contracts-are-a-cultural-problem/">contracts</a> between producers and
consumers actually honoured? These questions used to fail slowly and quietly — a
stale table, a mildly wrong dashboard, an annoyed analyst. Feed the same ungoverned
data to an AI system and the failure is fast, fluent, and scaled to every user who
asks. Governance was always the foundation; AI just turned the lights on and showed
everyone the cracks.</p>

<h2 id="ai-is-an-amplifier">AI is an amplifier</h2>

<p>The throughline is simple. Generative AI is an amplifier. Point it at well-structured,
well-governed, single-source-of-truth data and it amplifies that quality into fast,
trustworthy answers at a scale humans couldn’t match. Point it at the typical
accreted swamp — duplicated facts, fuzzy definitions, unowned tables — and it
amplifies <em>that</em> just as faithfully, manufacturing confident nonsense far faster than
any human could.</p>

<p>So the arrival of AI doesn’t change the data architect’s job. It changes the
<em>consequences</em> of doing it badly. The work — choosing the shape, defining the meaning,
assigning the ownership, guaranteeing the quality — is the same work it always was. It
has simply stopped being something you can get away with neglecting. (I’ve written
separately about <a href="/essays/what-genai-changes-about-data-architecture/">what GenAI actually changes versus what it
doesn’t</a>, for the fuller
accounting.)</p>

<p>The teams that win with AI won’t be the ones with the cleverest prompts. They’ll be
the ones whose data was already in deliberate shape when the AI showed up — because
an amplifier is only ever as good as the signal you feed it.</p>]]></content><author><name>dataarchitect.studio</name></author><summary type="html"><![CDATA[Retrieval-augmented generation, AI agents, and LLMs querying your warehouse are all only as reliable as the data beneath them. GenAI doesn't replace data architecture — it raises the price of getting it wrong.]]></summary></entry><entry><title type="html">What GenAI Actually Changes About Data Architecture — and What It Doesn’t</title><link href="https://dataarchitect.studio/essays/what-genai-changes-about-data-architecture/" rel="alternate" type="text/html" title="What GenAI Actually Changes About Data Architecture — and What It Doesn’t" /><published>2026-05-31T08:30:00+05:30</published><updated>2026-05-31T08:30:00+05:30</updated><id>https://dataarchitect.studio/essays/what-genai-changes-about-data-architecture</id><content type="html" xml:base="https://dataarchitect.studio/essays/what-genai-changes-about-data-architecture/"><![CDATA[<p>Every few years a technology arrives that vendors insist changes everything about
data architecture, and the honest answer is always the same: it changes some things,
leaves most things alone, and the trick is telling which is which before you rebuild
your stack around a slide deck. Generative AI is the current everything-changer. So
let’s do the accounting plainly — what it genuinely changes, and what it pointedly
does not.</p>

<h2 id="what-it-actually-changes">What it actually changes</h2>

<p>Three things are real, and worth taking seriously.</p>

<p><strong>A new storage primitive: the vector.</strong> To retrieve data by <em>meaning</em> rather than by
exact match, you store numerical representations — embeddings — and search them by
similarity. That’s a genuinely new access pattern, and it’s brought vector indexes
into the stack, whether as dedicated vector databases or as vector capabilities
bolted onto databases you already run. If your applications need semantic search or
retrieval-augmented generation, this is a real addition to your architecture, not
hype.</p>

<p><strong>Unstructured data becomes first-class.</strong> For decades, the analytical stack was built
around structured, tabular data, and the piles of text, documents, images, and
transcripts mostly sat untapped. GenAI gives you a practical way to extract meaning
from that unstructured mass and put it to work. That genuinely expands what data
architecture is responsible for — the <a href="/essays/data-warehouse-vs-data-lake-vs-lakehouse/">lake and lakehouse</a>
patterns that store raw, varied data suddenly have a lot more to do.</p>

<p><strong>A new retrieval pattern and a new consumer.</strong> RAG — fetch relevant context, then let
a model reason over it — is a legitimately new shape for how data flows to an answer.
And the consumer at the end of it is new in kind: a system that reads your data and
acts on it without the human skepticism that used to be the last line of defence.</p>

<p>That’s the real list. Notice what’s on it: a new index type, a new data category
brought into scope, a new retrieval flow. Additions to the architecture. Now notice
what <em>isn’t</em> on it.</p>

<h2 id="what-it-doesnt-change">What it doesn’t change</h2>

<p>The fundamentals. All of them. And not only do they survive — they get <em>more</em> load
bearing, not less.</p>

<p><strong>You still need deliberate structure.</strong> A vector index doesn’t absolve you of
modeling; it sits <em>alongside</em> your structured data, which still has to be shaped on
purpose. The <a href="/essays/a-field-guide-to-dimensional-modeling/">dimensional models</a> and
governed tables don’t disappear because some of your data is now embeddings. If
anything, AI consumers reading that structured data make its shape matter more.</p>

<p><strong>You still need data quality.</strong> As I’ve argued at length, <a href="/essays/your-ai-is-only-as-good-as-your-data-architecture/">a RAG system is only as
good as the data underneath it</a>.
Duplicated, stale, contradictory data produces duplicated, stale, contradictory
answers — now delivered fluently and at scale. Embeddings don’t clean your data. They
faithfully encode whatever mess you hand them.</p>

<p><strong>You still need governance and ownership.</strong> Who owns this dataset, what is it allowed
to mean, is it fresh — these questions don’t soften when an AI starts consuming the
data. They sharpen, because the failure mode goes from a quietly wrong dashboard to a
confidently wrong answer served to every user who asks.</p>

<p><strong>You still need defined meaning.</strong> An LLM querying your warehouse with no single
governed definition of “revenue” will invent its own, repeatedly and inconsistently.
The <a href="/essays/what-is-a-semantic-layer/">semantic layer</a> doesn’t become obsolete in the
AI era; it becomes the guardrail that makes the AI era survivable.</p>

<blockquote>
  <p>GenAI adds a vector index to the stack. It does not subtract the need for structure,
quality, governance, or meaning. The new thing is additive; the old things are
load-bearing.</p>
</blockquote>

<h2 id="the-pattern-to-be-skeptical-of">The pattern to be skeptical of</h2>

<p>Here’s where the hype does its damage. A great deal of “AI-native data platform”
messaging is, underneath, selling you a vector database and a retrieval pipeline while
implying you can now skip the boring disciplines — that the model is smart enough to
paper over messy, ungoverned, ambiguous data. This is the same move every
hype cycle makes, and it fails the same way. As with the
<a href="/essays/the-medallion-architecture-reconsidered/">medallion architecture</a>, the
trouble starts when a useful <em>addition</em> gets sold as a <em>replacement</em> for the
fundamentals it actually depends on. A vector index on top of a swamp is just a
faster way to retrieve swamp.</p>

<p>The tell is always the same: any pitch that lets you defer data quality, ownership, or
definition because “the AI handles it” is selling you a way to scale your existing
problems, not solve them.</p>

<h2 id="the-fundamentals-are-a-moat-not-a-relic">The fundamentals are a moat, not a relic</h2>

<p>So treat GenAI the way you’d treat any genuine-but-overhyped advance. Adopt the real
additions — vector storage where you need semantic retrieval, the means to bring
unstructured data into scope, RAG where it fits. And refuse the implied permission to
neglect everything else, because everything else is precisely what determines whether
the AI produces signal or confident noise.</p>

<p>The reassuring part, if you’ve spent years on the unglamorous work, is this: the
fundamentals you’ve been told are about to be made obsolete are in fact the thing that
separates teams who get value from AI from teams who get a very fast nonsense machine.
<a href="/essays/the-shape-of-data/">Deliberate structure</a>, clean data, clear ownership,
defined meaning — these aren’t a relic the new era renders quaint. They’re the moat.
AI just raised their value. The everything-changer, it turns out, mostly changed the
stakes of getting the basics right.</p>]]></content><author><name>dataarchitect.studio</name></author><summary type="html"><![CDATA[Cutting through the hype: GenAI adds vector storage and new retrieval patterns to the data stack, but the fundamentals — structure, quality, governance, ownership — matter more than ever, not less.]]></summary></entry><entry><title type="html">Slowly Changing Dimensions, Explained Without the Jargon</title><link href="https://dataarchitect.studio/essays/slowly-changing-dimensions-explained/" rel="alternate" type="text/html" title="Slowly Changing Dimensions, Explained Without the Jargon" /><published>2026-05-31T00:00:00+05:30</published><updated>2026-05-31T00:00:00+05:30</updated><id>https://dataarchitect.studio/essays/slowly-changing-dimensions-explained</id><content type="html" xml:base="https://dataarchitect.studio/essays/slowly-changing-dimensions-explained/"><![CDATA[<p>“Slowly changing dimensions” is an intimidating name for a simple idea. A dimension
describes context — a customer, a product, a store. Over time, that context changes:
the customer moves city, the product gets recategorised, the store switches region.
A slowly changing dimension, or SCD, is just a <em>strategy for what happens to history
when one of those attributes changes.</em> That’s the entire concept. The types — 1, 2,
3 — are only different answers to one question: do you keep the old value, or
overwrite it?</p>

<p>If you’ve read the <a href="/essays/a-field-guide-to-dimensional-modeling/">field guide to dimensional
modeling</a>, you’ve met this idea in
passing. Here’s the full version.</p>

<h2 id="the-question-every-dimension-eventually-asks">The question every dimension eventually asks</h2>

<p>Say you have a customer dimension, and the customer <code class="language-plaintext highlighter-rouge">C-1077</code> was in the “Small
Business” segment when they placed an order last year. This year they’ve grown and
been moved to “Enterprise.” Someone now runs a report: <em>sales by segment, last
year.</em> Should <code class="language-plaintext highlighter-rouge">C-1077</code>’s old orders count under “Small Business” (what they were at
the time) or “Enterprise” (what they are now)?</p>

<p>There is no universally correct answer — only a business decision. SCD types are the
menu of ways to implement whichever answer your business needs.</p>

<blockquote>
  <p>An SCD type is a technical setting that encodes a business choice: when reality
changes, does our history change with it, or stay as it was?</p>
</blockquote>

<h2 id="type-1--overwrite-forget-the-past">Type 1 — Overwrite (forget the past)</h2>

<p>The simplest strategy: when an attribute changes, you just update the value in
place. The old value is gone.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">UPDATE</span> <span class="n">dim_customer</span>
   <span class="k">SET</span> <span class="n">segment</span> <span class="o">=</span> <span class="s1">'Enterprise'</span>
 <span class="k">WHERE</span> <span class="n">customer_id</span> <span class="o">=</span> <span class="s1">'C-1077'</span><span class="p">;</span>
</code></pre></div></div>

<p>Now <em>all</em> of <code class="language-plaintext highlighter-rouge">C-1077</code>’s orders — past and future — report under “Enterprise,”
because that’s the only value the dimension remembers. Last year’s “sales by
segment” will retroactively change.</p>

<p><strong>Use Type 1 when history doesn’t matter for that attribute</strong> — correcting a
misspelled name, fixing a data-entry error, or tracking a value where only the
current state is ever meaningful. It’s cheap and easy. The cost is amnesia: you can
never again ask what things looked like before.</p>

<h2 id="type-2--add-a-new-row-keep-full-history">Type 2 — Add a new row (keep full history)</h2>

<p>This is the important one, the strategy people usually mean when they say “SCD.” On
change, you don’t overwrite — you <em>expire the old row and insert a new one</em>. The
customer now exists as two rows: the historical version and the current version,
each valid for a span of time.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dim_customer</span>
<span class="o">+</span><span class="c1">-------------+-------------+------------+------------+------------+------------+</span>
<span class="o">|</span> <span class="n">customer_key</span><span class="o">|</span> <span class="n">customer_id</span> <span class="o">|</span> <span class="n">segment</span>    <span class="o">|</span> <span class="n">valid_from</span> <span class="o">|</span> <span class="n">valid_to</span>   <span class="o">|</span> <span class="n">is_current</span> <span class="o">|</span>
<span class="o">+</span><span class="c1">-------------+-------------+------------+------------+------------+------------+</span>
<span class="o">|</span> <span class="mi">4401</span>        <span class="o">|</span> <span class="k">C</span><span class="o">-</span><span class="mi">1077</span>      <span class="o">|</span> <span class="n">Small</span> <span class="n">Bus</span><span class="p">.</span> <span class="o">|</span> <span class="mi">2024</span><span class="o">-</span><span class="mi">02</span><span class="o">-</span><span class="mi">01</span> <span class="o">|</span> <span class="mi">2026</span><span class="o">-</span><span class="mi">03</span><span class="o">-</span><span class="mi">14</span> <span class="o">|</span> <span class="k">false</span>      <span class="o">|</span>
<span class="o">|</span> <span class="mi">8902</span>        <span class="o">|</span> <span class="k">C</span><span class="o">-</span><span class="mi">1077</span>      <span class="o">|</span> <span class="n">Enterprise</span> <span class="o">|</span> <span class="mi">2026</span><span class="o">-</span><span class="mi">03</span><span class="o">-</span><span class="mi">15</span> <span class="o">|</span> <span class="mi">9999</span><span class="o">-</span><span class="mi">12</span><span class="o">-</span><span class="mi">31</span> <span class="o">|</span> <span class="k">true</span>       <span class="o">|</span>
<span class="o">+</span><span class="c1">-------------+-------------+------------+------------+------------+------------+</span>
</code></pre></div></div>

<p>Each fact (each order) points to the <code class="language-plaintext highlighter-rouge">customer_key</code> that was current <em>when the order
happened.</em> Last year’s orders reference key <code class="language-plaintext highlighter-rouge">4401</code> (“Small Business”); this year’s
reference <code class="language-plaintext highlighter-rouge">8902</code> (“Enterprise”). History stays honest — “sales by segment, last
year” returns what was actually true then, and it never changes.</p>

<p>Two things make Type 2 work, and both are worth noticing. First, the validity
columns (<code class="language-plaintext highlighter-rouge">valid_from</code>, <code class="language-plaintext highlighter-rouge">valid_to</code>, <code class="language-plaintext highlighter-rouge">is_current</code>) are what let two versions of “the
same” customer coexist and be queried by date. Second — and this is why I keep
linking them — Type 2 is <em>only possible</em> because of a <a href="/essays/surrogate-keys-vs-natural-keys/">surrogate
key</a>. The natural key <code class="language-plaintext highlighter-rouge">C-1077</code> is identical
on both rows; it’s the meaningless surrogate (<code class="language-plaintext highlighter-rouge">4401</code> vs <code class="language-plaintext highlighter-rouge">8902</code>) that distinguishes
the versions and gives facts something stable to point at. Without a surrogate key,
you simply cannot represent history this way.</p>

<p>The cost of Type 2 is that the dimension grows over time and your pipeline gets more
complex — it has to detect changes, expire old rows, and insert new ones correctly.
For most analytics where history matters, that complexity is worth paying.</p>

<h2 id="type-3--add-a-column-keep-limited-history">Type 3 — Add a column (keep limited history)</h2>

<p>A middle option, used far less often. Instead of a new row, you keep both the old
and new value side by side in <em>columns</em>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>customer_id | current_segment | previous_segment | segment_changed_on
C-1077      | Enterprise      | Small Business   | 2026-03-15
</code></pre></div></div>

<p>This preserves exactly <em>one</em> step of history — you can see the current and the
immediately prior value, but nothing further back. Type 3 fits the narrow case where
you care about “before and after” a single known transition (a re-org, a
re-branding) and don’t need full history. It’s a specialist tool; reach for Type 2
when you want history in general, and Type 3 only for this specific shape of
question.</p>

<h2 id="which-to-use">Which to use</h2>

<p>The choice is per-attribute, not per-table — a single dimension can mix strategies.
A customer’s <code class="language-plaintext highlighter-rouge">segment</code> might be Type 2 (you want historical accuracy for reporting),
while a typo correction in their <code class="language-plaintext highlighter-rouge">name</code> is Type 1 (no one wants to preserve the
misspelling). Decide attribute by attribute:</p>

<ul>
  <li><strong>Type 1</strong> — you only ever care about the current value, or you’re fixing an error.</li>
  <li><strong>Type 2</strong> — you need to report on history as it actually was. This is the
workhorse; when in doubt, this is usually the right default for attributes that
carry analytical meaning.</li>
  <li><strong>Type 3</strong> — you need exactly one prior value around a specific, known change.</li>
</ul>

<h2 id="the-trap-to-avoid">The trap to avoid</h2>

<p>The classic mistake isn’t choosing the wrong type — it’s <em>not choosing at all.</em>
Teams default to silent Type 1 (overwriting) simply because that’s what a plain
<code class="language-plaintext highlighter-rouge">UPDATE</code> does, and only discover the problem months later when someone asks for a
historical breakdown and finds the past has been quietly rewritten. By then the old
values are gone and unrecoverable.</p>

<p>So make the decision deliberately, attribute by attribute, before the data starts
flowing. That act — deciding what history means <em>before</em> you need it — is exactly
the kind of small, deliberate choice that separates a designed dimension from one
that merely accumulated. Slowly changing dimensions aren’t arcane. They’re just the
place where you decide, on purpose, what your data is allowed to forget.</p>]]></content><author><name>dataarchitect.studio</name></author><summary type="html"><![CDATA[Slowly changing dimensions answer one question: when a dimension attribute changes, do you overwrite history or preserve it? Here are SCD Types 1, 2, and 3, and exactly when to use each.]]></summary></entry><entry><title type="html">Data Warehouse vs Data Lake vs Lakehouse: A Clear Comparison</title><link href="https://dataarchitect.studio/essays/data-warehouse-vs-data-lake-vs-lakehouse/" rel="alternate" type="text/html" title="Data Warehouse vs Data Lake vs Lakehouse: A Clear Comparison" /><published>2026-05-30T00:00:00+05:30</published><updated>2026-05-30T00:00:00+05:30</updated><id>https://dataarchitect.studio/essays/data-warehouse-vs-data-lake-vs-lakehouse</id><content type="html" xml:base="https://dataarchitect.studio/essays/data-warehouse-vs-data-lake-vs-lakehouse/"><![CDATA[<p>Three terms get used almost interchangeably and mean genuinely different things: the
<strong>data warehouse</strong>, the <strong>data lake</strong>, and the <strong>lakehouse</strong>. The confusion is
understandable, because all three are “places you put data for analytics.” But they
make opposite bets about structure, cost, and trust, and choosing well means
understanding the bet each one makes. Here’s the clear version.</p>

<h2 id="data-warehouse-structure-first">Data warehouse: structure first</h2>

<p>A data warehouse stores <strong>structured, modeled data</strong>, optimised for analytical
queries. Before data lands in a warehouse, it’s cleaned, shaped, and fitted to a
schema — an approach called <em>schema-on-write</em>, because the structure is enforced at
the moment you write the data in.</p>

<p>This is the world of <a href="/essays/a-field-guide-to-dimensional-modeling/">dimensional
models</a>,
<a href="/essays/star-schema-vs-snowflake-schema/">star schemas</a>, and curated tables. The
warehouse’s whole value proposition is <strong>trust and speed for analytics</strong>: because
everything is modeled and typed up front, queries are fast, results are consistent,
and a business analyst can point a BI tool at it and get reliable answers without
thinking about plumbing.</p>

<p>The costs are real, though. Schema-on-write means <strong>upfront modeling work</strong> before
data is usable, and rigidity afterward — adding a new data source or changing shape
takes deliberate effort. Warehouses also traditionally <strong>couple storage and
compute</strong> and charge accordingly, which gets expensive at scale and makes them a
poor home for huge volumes of raw or semi-structured data (logs, images, free text)
that don’t fit neatly into columns.</p>

<h2 id="data-lake-flexibility-first">Data lake: flexibility first</h2>

<p>A data lake makes the opposite bet. It stores <strong>raw data of any shape</strong> — structured,
semi-structured, unstructured — as files in cheap object storage (S3, GCS, Azure
Blob). There’s no schema required to <em>write</em>; you impose structure later, when you
<em>read</em>, an approach called <em>schema-on-read</em>.</p>

<p>This buys two things warehouses struggle with. First, <strong>cost</strong>: object storage is
cheap, so you can keep enormous volumes of data affordably. Second, <strong>flexibility</strong>:
you can dump data in now and decide what to do with it later, which suits data
science, machine learning, and any workload that wants raw, untransformed inputs.
The lake is also the natural home for <a href="/essays/the-medallion-architecture-reconsidered/">immutable raw history you can always reprocess
from</a>.</p>

<p>But flexibility has a failure mode, and it has a name: the <strong>data swamp</strong>. With no
enforced schema, no guaranteed quality, and often no clear ownership, a lake can
degrade into a vast pile of files nobody trusts or understands. There’s also no
built-in transactional guarantee — no clean notion of “this table is in a consistent
state right now” — which makes lakes hard to use for the reliable, concurrent
analytics that warehouses do effortlessly. A lake gives you everything and promises
nothing.</p>

<h2 id="lakehouse-an-attempt-to-have-both">Lakehouse: an attempt to have both</h2>

<p>The lakehouse is the newer idea, and it’s exactly what the name suggests: an attempt
to get <strong>warehouse-like reliability on data-lake economics.</strong> You keep your data in
cheap object storage as files — the lake part — but add a <strong>metadata and table layer
on top</strong> that brings the structure and guarantees a warehouse has.</p>

<blockquote>
  <p>The lakehouse bet: keep the cheap, flexible storage of a lake, but add a layer that
gives it the schema, transactions, and trust of a warehouse — so you don’t have to
run and sync both.</p>
</blockquote>

<p>That added layer is delivered by <strong>open table formats</strong> — Apache Iceberg, Delta Lake,
Apache Hudi — which sit over the raw files and provide the things lakes lacked: ACID
transactions (consistent reads and writes), schema enforcement and evolution,
time-travel to past versions, and performance optimisations. The result is a single
system where you can run both the flexible, ML-style workloads of a lake <em>and</em> the
reliable, structured BI of a warehouse, without maintaining two separate platforms
and a brittle pipeline copying between them.</p>

<p>The trade-off is <strong>maturity and complexity.</strong> The lakehouse stack is younger and has
more moving parts than a turnkey warehouse; you’re assembling storage, a table
format, a query engine, and a catalog rather than buying one integrated product. For
some teams that flexibility is the point; for others it’s overhead they don’t need.</p>

<h2 id="how-to-choose">How to choose</h2>

<p>Strip away the marketing and it comes down to your workload:</p>

<ul>
  <li><strong>Choose a warehouse</strong> when your work is overwhelmingly structured analytics and BI,
you value simplicity and reliability over flexibility, and your data volumes are
manageable. For a team that mostly builds dashboards and reports, a warehouse is
still the simplest, most dependable answer — don’t over-engineer past it.</li>
  <li><strong>Choose a lake</strong> when you have large volumes of raw, varied, or unstructured data,
heavy data-science or ML needs, and the engineering discipline to stop it becoming
a swamp. Rarely the whole answer on its own anymore.</li>
  <li><strong>Choose a lakehouse</strong> when you genuinely need both — structured BI <em>and</em> raw/ML
workloads on the same large datasets — and want to avoid running a separate lake and
warehouse with a sync pipeline between them. This is where many teams are
converging, but adopt it because you have both needs, not because it’s the newest
word.</li>
</ul>

<h2 id="the-part-none-of-them-solve">The part none of them solve</h2>

<p>One caution worth ending on. Whichever you pick, it answers <em>where data is stored and
how it’s structured</em> — a physical question. None of the three tells you what your
data <em>means</em>: which definition of “revenue” is canonical, who owns the customer
table, how “active user” is defined. That’s the job of a <a href="/essays/what-is-a-semantic-layer/">semantic
layer</a>, which sits <em>above</em> all three. Teams often
expect a shiny new lakehouse to fix their consistency problems and are surprised when
the same three-different-numbers arguments continue — because storage was never the
thing causing them. Choose the right store for your workload. Then govern the meaning
on top of it. They’re different problems, and you need answers to both.</p>]]></content><author><name>dataarchitect.studio</name></author><summary type="html"><![CDATA[A data warehouse stores structured, modeled data for analytics. A data lake stores raw data of any shape, cheaply. A lakehouse tries to be both. Here's the real trade-off and how to choose.]]></summary></entry><entry><title type="html">How to Make a Data Pipeline Idempotent</title><link href="https://dataarchitect.studio/essays/how-to-make-a-data-pipeline-idempotent/" rel="alternate" type="text/html" title="How to Make a Data Pipeline Idempotent" /><published>2026-05-29T00:00:00+05:30</published><updated>2026-05-29T00:00:00+05:30</updated><id>https://dataarchitect.studio/essays/how-to-make-a-data-pipeline-idempotent</id><content type="html" xml:base="https://dataarchitect.studio/essays/how-to-make-a-data-pipeline-idempotent/"><![CDATA[<p>A data pipeline that can’t be safely re-run is a liability waiting for a bad night.
Jobs fail halfway. Schedulers retry. Someone kicks off a backfill over a range that
already partly ran. If any of those scenarios can corrupt your data — duplicate
rows, double-counted revenue, inconsistent state — you don’t have a pipeline, you
have a trap. The property that defuses all of it is <strong>idempotency</strong>, and building
it in is more about a few disciplined patterns than about clever code.</p>

<h2 id="what-idempotency-actually-means">What idempotency actually means</h2>

<p>An operation is idempotent if running it many times produces the same result as
running it once. Press a floor button in an elevator five times; the elevator still
goes to the floor once. For a pipeline, it means: <em>re-running the same job with the
same input leaves the data in exactly the same correct state, no matter how many
times it executes.</em></p>

<p>This is not the same as “exactly-once processing,” which is a hard distributed-systems
guarantee about each input being handled precisely one time. Idempotency is more
achievable and, for most batch and many streaming pipelines, more useful: you stop
trying to guarantee the job runs once, and instead make it <em>safe to run any number
of times.</em> Retries become boring. Backfills become safe. That’s the whole prize.</p>

<blockquote>
  <p>Don’t try to guarantee your job runs exactly once. Make it not matter how many
times it runs.</p>
</blockquote>

<h2 id="the-anti-pattern-to-eliminate-first">The anti-pattern to eliminate first</h2>

<p>The single most common idempotency killer is the <strong>blind append</strong>:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">fact_sales</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">staging_sales</span> <span class="k">WHERE</span> <span class="n">sale_date</span> <span class="o">=</span> <span class="s1">'2026-05-29'</span><span class="p">;</span>
</code></pre></div></div>

<p>Run this twice and you have every row twice. The job has no notion of “I already
did this date” — it just appends. Every pattern below is, at heart, a way to replace
blind appends with operations that can absorb a re-run without duplicating.</p>

<h2 id="pattern-1-overwrite-by-partition">Pattern 1: Overwrite by partition</h2>

<p>The simplest and most robust pattern: make your unit of work a <strong>partition</strong> (most
often a date), and have the job <em>replace</em> that partition rather than add to it.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">DELETE</span> <span class="k">FROM</span> <span class="n">fact_sales</span> <span class="k">WHERE</span> <span class="n">sale_date</span> <span class="o">=</span> <span class="s1">'2026-05-29'</span><span class="p">;</span>
<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">fact_sales</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">staging_sales</span> <span class="k">WHERE</span> <span class="n">sale_date</span> <span class="o">=</span> <span class="s1">'2026-05-29'</span><span class="p">;</span>
</code></pre></div></div>

<p>Or, in a partitioned lake/warehouse, overwrite the single partition atomically
(<code class="language-plaintext highlighter-rouge">INSERT OVERWRITE</code>, <code class="language-plaintext highlighter-rouge">replaceWhere</code>, dynamic partition overwrite — the syntax varies
by engine). Now re-running the job for 2026-05-29 produces the same result every
time: it clears the day and rebuilds it. Retries and backfills over any date range
are automatically safe, because each date is rebuilt from scratch from its source.</p>

<p>This works beautifully when your data is naturally partitioned by an immutable
processing window and you can fully recompute a partition from its inputs. Make the
delete-and-load atomic so a failure mid-way can’t leave the partition empty.</p>

<h2 id="pattern-2-merge-upsert-on-a-key">Pattern 2: Merge (upsert) on a key</h2>

<p>When you’re maintaining mutable state — a dimension, a “current status per entity”
table — overwriting partitions doesn’t fit. Here the tool is a <strong>merge</strong> keyed on a
stable identifier:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">MERGE</span> <span class="k">INTO</span> <span class="n">dim_customer</span> <span class="k">AS</span> <span class="n">t</span>
<span class="k">USING</span> <span class="n">staging_customer</span> <span class="k">AS</span> <span class="n">s</span>
  <span class="k">ON</span> <span class="n">t</span><span class="p">.</span><span class="n">customer_id</span> <span class="o">=</span> <span class="n">s</span><span class="p">.</span><span class="n">customer_id</span>
<span class="k">WHEN</span> <span class="n">MATCHED</span> <span class="k">THEN</span> <span class="k">UPDATE</span> <span class="k">SET</span> <span class="p">...</span>
<span class="k">WHEN</span> <span class="k">NOT</span> <span class="n">MATCHED</span> <span class="k">THEN</span> <span class="k">INSERT</span> <span class="p">...;</span>
</code></pre></div></div>

<p>Because the merge keys on <code class="language-plaintext highlighter-rouge">customer_id</code>, running it twice with the same input is a
no-op the second time — matched rows are updated to the same values, and nothing new
is inserted. This is why a <a href="/essays/surrogate-keys-vs-natural-keys/">stable surrogate or business
key</a> matters so much: idempotent upserts
are only possible if every row has a reliable identity to match on. No key, no safe
merge.</p>

<h2 id="pattern-3-deterministic-output-keys">Pattern 3: Deterministic output keys</h2>

<p>For systems without merge, you can lean on the storage layer’s uniqueness. Compute a
<strong>deterministic primary key</strong> for each output row from its inputs — say, a hash of
the natural key plus the event timestamp — and write with insert-or-ignore / upsert
semantics. Because the same input always produces the same key, a re-run collides
with the rows it wrote last time instead of duplicating them. The key <em>is</em> the
idempotency.</p>

<h2 id="the-rule-for-side-effects">The rule for side effects</h2>

<p>Transformations on data inside your warehouse are the easy case. The dangerous case
is <strong>external side effects</strong> — sending an email, posting to an API, incrementing a
counter, dropping a message on a queue. These are rarely idempotent by default:
retry the step and you send the email twice.</p>

<p>The fixes are the same family of ideas applied outward: attach an <strong>idempotency key</strong>
to each external action and have the receiving system de-duplicate on it (most
serious payment and messaging APIs support exactly this); or <strong>record what you’ve
already done</strong> in a state table and check it before acting. The principle holds —
make the <em>effect</em> of running twice identical to running once.</p>

<h2 id="a-checklist">A checklist</h2>

<p>Before you call a pipeline safe to re-run, confirm:</p>

<ul>
  <li>Re-running the same job for the same window produces identical output — verify it,
don’t assume it.</li>
  <li>There are no blind <code class="language-plaintext highlighter-rouge">INSERT</code>s; every write is an overwrite, a merge, or keyed.</li>
  <li>Mutable tables are merged on a stable key.</li>
  <li>Every external side effect carries an idempotency key or a “have I done this?” check.</li>
  <li>A job that fails mid-run leaves no partial, half-correct state behind.</li>
</ul>

<p>Get these right and the 3 a.m. retry stops being an incident. The scheduler can hammer
the job, an engineer can rerun last month’s backfill without flinching, and the data
lands in exactly one correct state every time — which is the entire point of building
the thing carefully in the first place.</p>]]></content><author><name>dataarchitect.studio</name></author><summary type="html"><![CDATA[An idempotent data pipeline produces the same result whether it runs once or five times. Here are the concrete patterns — partition overwrite, merge on keys, delete-insert — that make retries and backfills safe.]]></summary></entry><entry><title type="html">What Is a Semantic Layer, and Why Does Your Data Stack Need One?</title><link href="https://dataarchitect.studio/essays/what-is-a-semantic-layer/" rel="alternate" type="text/html" title="What Is a Semantic Layer, and Why Does Your Data Stack Need One?" /><published>2026-05-28T00:00:00+05:30</published><updated>2026-05-28T00:00:00+05:30</updated><id>https://dataarchitect.studio/essays/what-is-a-semantic-layer</id><content type="html" xml:base="https://dataarchitect.studio/essays/what-is-a-semantic-layer/"><![CDATA[<p>Here is a problem every data team eventually has. Three dashboards show “active
users.” All three numbers are different. Each was built by a different person who
made a slightly different, undocumented choice about what “active” means — a
30-day window here, an exclusion of internal accounts there. No one is wrong,
exactly, and yet the organisation can no longer answer a simple question about
itself. A <strong>semantic layer</strong> is the architectural answer to this problem.</p>

<h2 id="a-definition">A definition</h2>

<p>A semantic layer is a single, governed place where your business concepts and
metrics are defined <em>once</em>, in terms the business uses, independent of where the
underlying data physically lives or which tool consumes it.</p>

<p>Instead of the definition of “active user” living implicitly inside a dozen
dashboard queries, it lives in one explicit place — <code class="language-plaintext highlighter-rouge">active_user = a user with at
least one session in the trailing 30 days, excluding internal accounts</code> — and every
dashboard, notebook, and API asks the semantic layer for that metric rather than
re-deriving it. Define once; consume everywhere; agree always.</p>

<blockquote>
  <p>A semantic layer turns a metric from something each report <em>re-implements</em> into
something the organisation <em>declares</em>. The definition stops being folklore and
becomes infrastructure.</p>
</blockquote>

<h2 id="what-it-sits-between">What it sits between</h2>

<p>Picture two halves of your stack. Below is the <strong>physical</strong> half: warehouse tables,
the raw and refined layers, the actual columns and rows. Above is the <strong>consumption</strong>
half: BI dashboards, ad-hoc SQL, notebooks, reverse-ETL, the AI tool someone just
wired up to your warehouse.</p>

<p>The semantic layer sits in the middle and translates between them. It maps messy
physical reality (“revenue is <code class="language-plaintext highlighter-rouge">net_amount</code> from <code class="language-plaintext highlighter-rouge">fact_order_line</code>, minus refunds
joined from <code class="language-plaintext highlighter-rouge">fact_returns</code>, in the reporting currency”) to a clean business concept
(“revenue”) that every consumer can request by name. Change the physical
implementation underneath — refactor a table, fix a join — and consumers above
never notice, because they only ever referred to “revenue,” not to the columns.</p>

<h2 id="what-it-is-not">What it is <em>not</em></h2>

<p>Three confusions are worth clearing up, because each one leads teams to think they
have a semantic layer when they don’t.</p>

<p><strong>It is not just your BI tool.</strong> Tools like Looker, Power BI, and Tableau have <em>had</em>
semantic modeling features for years, and that’s good — but if the definitions live
only inside one BI tool, then your notebooks, your SQL users, and your other tools
don’t share them. The “active user” defined in the BI tool and the one a data
scientist computes in a notebook drift apart again. A real semantic layer is
<em>consumer-agnostic</em>: every tool, BI or not, gets the same answer.</p>

<p><strong>It is not the gold layer.</strong> This is the deeper confusion. As I argued in
<a href="/essays/the-medallion-architecture-reconsidered/">reconsidering the medallion architecture</a>,
bronze-silver-gold describes how data gets <em>cleaner</em> — physical refinement. It does
not describe what data <em>means</em>. Teams keep trying to solve semantic consistency by
building more gold tables, and it never quite works, because a pile of curated
tables with no single governed definition of “revenue” is just a tidier way to
produce three different revenue numbers. The semantic layer is the thing medallion
structurally doesn’t give you.</p>

<p><strong>It is not a data contract.</strong> <a href="/essays/data-contracts-are-a-cultural-problem/">Contracts</a>
govern the promises between a <em>producer</em> and a <em>consumer</em> of a dataset. A semantic
layer governs the <em>meaning of metrics</em> consumed downstream. Related, complementary,
not the same.</p>

<h2 id="why-its-having-a-moment">Why it’s having a moment</h2>

<p>Two forces have pushed the semantic layer from “nice idea” to “increasingly
necessary.” The first is the rise of <strong>headless BI and dedicated metrics layers</strong> —
the idea that metric definitions deserve their own governed tier, queryable by any
tool through an API, rather than being trapped inside one dashboard product.</p>

<p>The second is <strong>AI consumers</strong>. The moment you let a language model query your
warehouse, the cost of ambiguous metrics goes up sharply. A human analyst who gets
a weird number will sanity-check it; an LLM will confidently report whatever the
schema hands it. If “revenue” isn’t defined in exactly one governed place, an AI
tool will cheerfully compute its own version and present it as fact. A semantic
layer is fast becoming the guardrail that makes querying-by-AI trustworthy at all.</p>

<h2 id="the-takeaway">The takeaway</h2>

<p>If your team argues about whose number is right, you don’t have a tooling problem —
you have a <em>missing semantic layer</em>. Somewhere between your tables and your
dashboards, there should be one governed place that says what each metric means, so
that every consumer asks the same question and gets the same answer. Build that, and
the three-dashboards-three-numbers problem doesn’t get patched. It stops being
possible.</p>]]></content><author><name>dataarchitect.studio</name></author><summary type="html"><![CDATA[A semantic layer is the single, governed place where business metrics are defined once — independent of any dashboard. Here's what it is, what it isn't, and why it fixes the 'three numbers for one metric' problem.]]></summary></entry><entry><title type="html">The Medallion Architecture, Reconsidered</title><link href="https://dataarchitect.studio/essays/the-medallion-architecture-reconsidered/" rel="alternate" type="text/html" title="The Medallion Architecture, Reconsidered" /><published>2026-05-27T00:00:00+05:30</published><updated>2026-05-27T00:00:00+05:30</updated><id>https://dataarchitect.studio/essays/the-medallion-architecture-reconsidered</id><content type="html" xml:base="https://dataarchitect.studio/essays/the-medallion-architecture-reconsidered/"><![CDATA[<p>The medallion architecture — bronze, silver, gold — has become the default mental
model for organising a data lakehouse, and for good reason. It’s memorable, it maps
to an intuition everyone shares, and it gives a sprawling pile of data a sense of
direction. I’ve recommended it. I still do, sometimes. But defaults have a way of
hardening into dogma, and this one has hardened more than most. It’s worth a second,
honest look.</p>

<h2 id="what-the-layers-get-right">What the layers get right</h2>

<p>Stripped to its essence, the pattern says: data should flow through stages of
increasing refinement and trust. <strong>Bronze</strong> is raw, as-ingested, untouched —
history you can always replay from. <strong>Silver</strong> is cleaned, conformed, deduplicated
— the validated, queryable middle. <strong>Gold</strong> is business-ready — the curated,
aggregated tables that feed dashboards and decisions.</p>

<p>Two things here are genuinely valuable and worth keeping no matter what you call
the layers.</p>

<p>The first is <strong>provenance</strong>. Keeping raw data immutable in bronze means you can
always answer “where did this number come from?” and, crucially, <em>reprocess</em> when
you discover a bug in your logic. Throwing away raw data to save space is one of
the few truly irreversible mistakes in this field. Bronze, as a discipline, guards
against it.</p>

<p>The second is <strong>progressive refinement</strong> as an explicit idea — the recognition that
raw data and decision-ready data are different things with different consumers, and
that the transformation between them deserves to be a deliberate, staged, testable
process rather than one heroic query.</p>

<p>Those two ideas are good architecture. Hold onto them.</p>

<h2 id="where-it-quietly-falls-apart">Where it quietly falls apart</h2>

<p>The trouble starts when the three layers stop being a <em>guideline</em> and start being a
<em>rule</em> — when every table must belong to a tier and every problem must be solved by
adding another one. Several failure modes recur.</p>

<p><strong>Gold sprawls.</strong> “Business-ready” has no natural boundary, so gold becomes a
dumping ground of one-off aggregates, each built for a single dashboard, each
subtly re-deriving metrics the others already computed. You end up with the exact
inconsistency the architecture was supposed to prevent, now wearing a gold badge.
Three tables claim to hold “monthly revenue” and all three disagree.</p>

<p><strong>Silver becomes a swamp.</strong> Without a clear contract for what “cleaned and
conformed” means, silver fills with tables that are <em>sort of</em> clean — half-modeled,
inconsistently grained, owned by no one. It’s no longer raw and not yet trustworthy,
which is the worst of both worlds: a layer everyone reads from and no one believes.</p>

<p><strong>The layers masquerade as a semantic layer.</strong> This is the deepest problem. Medallion
describes <em>physical refinement</em> — how raw bytes become clean tables. It says nothing
about <em>meaning</em> — what a “customer” is, how “active” is defined, which metric is
canonical. Teams keep trying to solve semantic problems by adding more gold tables,
and it never works, because the layers were never about semantics. A bronze→silver→
gold pipeline with no governed definitions on top is just a very tidy way to produce
inconsistent numbers.</p>

<blockquote>
  <p>Medallion answers “how clean is this data?” It does not answer “what does this
data mean, and who decides?” — and most data disasters are failures of the
second question, not the first.</p>
</blockquote>

<h2 id="a-more-honest-framing">A more honest framing</h2>

<p>I’ve stopped presenting medallion as <em>the</em> architecture and started presenting it
as what it actually is: a useful convention for <strong>refinement</strong>, which must be paired
with two things it doesn’t supply.</p>

<p>The first is <strong>ownership per dataset</strong>, not per layer. The question that matters
isn’t “which tier is this in?” but “who owns this table’s shape and meaning?” A gold
table no one owns is more dangerous than a bronze one, because people <em>trust</em> it.
Tiers are a property of data; ownership is a property of teams. Don’t confuse them.</p>

<p>The second is <strong>a semantic layer that sits across the tiers</strong> — a single governed
place where business concepts and metrics are defined once, regardless of which
physical table they’re computed from. This is the part medallion structurally
cannot give you, and it’s the part that actually determines whether your
organisation agrees on what its numbers mean.</p>

<p>So: keep bronze’s immutable provenance. Keep the discipline of staged refinement.
But hold the three-layer model loosely. It’s a sketch of <em>how data gets cleaner</em>,
not a theory of <em>how data gets meaning</em>. Treat it as the former and it serves you
well. Mistake it for the latter — as a surprising number of teams do — and you’ll
spend a year building gold tables to solve a problem that was never about gold at
all.</p>]]></content><author><name>dataarchitect.studio</name></author><summary type="html"><![CDATA[Bronze, silver, gold is a useful default and a dangerous dogma. A second look at what the layers get right, and where they quietly fall apart.]]></summary></entry><entry><title type="html">OLTP vs OLAP: Why You Shouldn’t Run Analytics on Your App Database</title><link href="https://dataarchitect.studio/essays/oltp-vs-olap/" rel="alternate" type="text/html" title="OLTP vs OLAP: Why You Shouldn’t Run Analytics on Your App Database" /><published>2026-05-23T00:00:00+05:30</published><updated>2026-05-23T00:00:00+05:30</updated><id>https://dataarchitect.studio/essays/oltp-vs-olap</id><content type="html" xml:base="https://dataarchitect.studio/essays/oltp-vs-olap/"><![CDATA[<p>Every data architecture eventually runs into the difference between OLTP and OLAP,
usually the hard way: someone runs a big analytical query against the production
application database, and the app slows to a crawl for everyone. The two acronyms
describe two kinds of database workload that are optimised for <em>opposite</em> things, and
understanding why is the foundation under nearly every decision about where analytics
should live.</p>

<h2 id="two-opposite-jobs">Two opposite jobs</h2>

<p><strong>OLTP</strong> — Online Transaction Processing — is the workload your <em>application</em> runs. It’s
defined by <strong>many small, fast operations</strong>: a user places an order, updates their
profile, adds an item to a cart. Thousands of these happen concurrently, each
touching a handful of rows, mixing reads and writes. The database behind your app —
Postgres, MySQL, SQL Server — is an OLTP system, and it’s tuned to do this well:
insert a row, update a record, look up a single customer by ID, all in milliseconds.</p>

<p><strong>OLAP</strong> — Online Analytical Processing — is the workload your <em>analytics</em> runs. It’s
defined by <strong>few large, complex operations</strong>: “total revenue by product category by
month for the last three years.” A single such query might scan millions or billions
of rows, aggregate them, and join across large tables. There are far fewer of these
queries, they’re almost entirely reads, and each one is enormous compared to an OLTP
operation.</p>

<blockquote>
  <p>OLTP is many people each touching a little data. OLAP is a few people each touching
a lot of data. Almost every difference between the two follows from that.</p>
</blockquote>

<h2 id="why-the-same-database-cant-do-both-well">Why the same database can’t do both well</h2>

<p>The two workloads don’t just differ in size — they pull the underlying design in
incompatible directions. Three differences matter most.</p>

<p><strong>Row storage vs column storage.</strong> OLTP databases store data <em>by row</em>, because a
transaction typically wants a whole record at once (give me everything about this
order). OLAP systems store data <em>by column</em>, because an analytical query typically
wants one or two columns across a vast number of rows (give me the <code class="language-plaintext highlighter-rouge">amount</code> column for
ten million orders, and sum it). Reading a single column from row-stored data means
touching every row to extract one field — slow. Column storage reads just that
column — fast. The storage layout that’s right for one workload is actively wrong for
the other.</p>

<p><strong>Normalized vs denormalized.</strong> OLTP schemas are <em>normalized</em> — data split across many
tables to avoid redundancy and keep writes consistent and cheap. That’s ideal for
transactions but painful for analysis, where answering a business question means
joining a dozen normalized tables together every time. OLAP systems are
<em>denormalized</em> — data deliberately pre-joined into wide tables and
<a href="/essays/a-field-guide-to-dimensional-modeling/">dimensional models</a> like
<a href="/essays/star-schema-vs-snowflake-schema/">star schemas</a> — so analytical queries are
simple and fast. Again: opposite choices, each correct for its own job.</p>

<p><strong>Contention.</strong> This is the one that bites you in production. A heavy OLAP query
scanning millions of rows consumes huge amounts of memory, CPU, and I/O, and can hold
locks or saturate the database while it runs. On a dedicated analytical system,
fine — that’s what it’s for. On your <em>production OLTP database</em>, that same query
starves the fast little transactions your application depends on, and real users feel
it: checkouts hang, pages time out. You’ve made your app slow to compute a report.</p>

<h2 id="the-architectural-answer">The architectural answer</h2>

<p>Because one system can’t serve both workloads well, the standard architecture is to
<strong>keep them separate and move data from one to the other.</strong> Your application writes to
its OLTP database, optimised for transactions. On some cadence — batch jobs, or
<a href="/essays/how-to-make-a-data-pipeline-idempotent/">change data capture</a> streaming
changes continuously — that data is copied into a separate analytical store (a
warehouse or lakehouse) optimised for OLAP, where it’s reshaped into denormalized,
column-stored, analytics-friendly models.</p>

<p>This separation is why the modern data stack looks the way it does. The warehouse
isn’t a second copy of your database for no reason — it exists precisely <em>because</em> the
analytical workload needed its own home, with the opposite storage model, the
opposite schema design, and its own compute that can’t slow down your app. The
<a href="/essays/how-to-make-a-data-pipeline-idempotent/">pipelines that keep it in sync</a> are
the bridge between the two worlds.</p>

<h2 id="the-rule-of-thumb">The rule of thumb</h2>

<p>When you catch yourself about to run analytics against a production application
database, stop and recognise the workload mismatch. A little reporting on a small app
is survivable; real analytics at scale is not. The instinct to “just query the prod
DB” is the instinct that takes the site down at month-end close.</p>

<p>Keep transactions on OLTP. Keep analytics on OLAP. Move data deliberately between
them. It looks like more infrastructure than necessary right up until the moment a
single analyst’s query would have frozen your checkout flow — and then it looks like
exactly the right amount.</p>]]></content><author><name>dataarchitect.studio</name></author><summary type="html"><![CDATA[OLTP systems handle many small transactions fast. OLAP systems scan huge volumes for analysis. They're optimized for opposite things — which is why querying your production database for analytics is a trap.]]></summary></entry><entry><title type="html">Data Contracts Are a Cultural Problem</title><link href="https://dataarchitect.studio/essays/data-contracts-are-a-cultural-problem/" rel="alternate" type="text/html" title="Data Contracts Are a Cultural Problem" /><published>2026-05-19T00:00:00+05:30</published><updated>2026-05-19T00:00:00+05:30</updated><id>https://dataarchitect.studio/essays/data-contracts-are-a-cultural-problem</id><content type="html" xml:base="https://dataarchitect.studio/essays/data-contracts-are-a-cultural-problem/"><![CDATA[<p>The phrase “data contract” makes the hard thing sound easy. It conjures a tidy
artefact — a YAML file, a JSON schema, a validation step in CI — and implies that
once you have the artefact, you have the contract. You do not. The artefact is the
easy ten percent. The other ninety percent is an agreement between human beings
about who owes what to whom, and no schema validator has ever enforced <em>that</em>.</p>

<h2 id="what-a-contract-actually-is">What a contract actually is</h2>

<p>A data contract is a promise from a <strong>producer</strong> to a <strong>consumer</strong>. The producer
says: this data will arrive, in this shape, with this meaning, at this cadence,
with these guarantees — and if any of that is going to change, you’ll hear about it
before it breaks you.</p>

<p>Read that promise again and notice how little of it is about types. “In this shape”
is the schema part, and it’s genuinely useful — catching a column rename before it
reaches production is a real win. But “with this meaning,” “at this cadence,”
“you’ll hear about it before it breaks you” — these are <em>commitments</em>, not
constraints. They describe a relationship. And relationships are not enforced by
files; they’re enforced by accountability.</p>

<blockquote>
  <p>A schema tells you the <code class="language-plaintext highlighter-rouge">status</code> column is a string. It cannot tell you that
<code class="language-plaintext highlighter-rouge">status = 'closed'</code> quietly started meaning two different things last March
because an upstream team shipped a feature and told no one.</p>
</blockquote>

<p>That second failure — the semantic drift, the silent change of meaning — is the one
that actually destroys trust in data. And it’s precisely the failure a schema check
sails straight past.</p>

<h2 id="why-contracts-fail">Why contracts fail</h2>

<p>When data contract initiatives fail, and many do, the autopsy almost always finds
the same three causes. None of them are technical.</p>

<p><strong>No one owns the producing system.</strong> The data is emitted by a service whose team
considers analytics someone else’s problem. They didn’t agree to a contract; a
contract was <em>declared at them</em> by a downstream team with no leverage. The first
time shipping velocity conflicts with the contract, the contract loses, because no
one on the producing side ever signed up to defend it.</p>

<p><strong>There’s no consequence for breaking it.</strong> A promise with no cost for breaking it
is not a promise; it’s a suggestion. If a producer can break the contract and the
only result is a Slack message from an annoyed analyst, the contract has no teeth.
Real contracts are backed by something — a failing build that blocks <em>their</em>
deploy, an SLA tied to a team’s objectives, a review gate they actually have to
pass.</p>

<p><strong>It was written by the wrong people.</strong> Contracts drafted entirely by the data team
encode what the data team <em>wishes</em> were true, not what the producer can actually
commit to. A contract the producer didn’t help write is a wish list, and wish
lists rot.</p>

<h2 id="the-cultural-prerequisites">The cultural prerequisites</h2>

<p>This is why I’ve come to think of data contracts as a cultural problem with a
technical surface. Before the YAML is worth writing, three things have to be true
in the organisation:</p>

<ul>
  <li><strong>Producers accept that their output is a product.</strong> The moment a team emits data
that another team depends on, they are running a product whether they like it or
not. The contract just makes the existing dependency explicit. Cultures that
resist this resist the contract.</li>
  <li><strong>Ownership is unambiguous.</strong> Every important dataset has a name attached — a team
that is genuinely responsible for its shape and meaning, with the authority to
make and defend decisions about it. Shared ownership is no ownership.</li>
  <li><strong>Breaking changes have a cost the producer feels.</strong> Until the producer has skin
in the game, the contract is decoration.</li>
</ul>

<p>Get those right and the technical part becomes almost trivial — a schema in version
control, a check in CI, a changelog, a deprecation window. Get them wrong and the
most beautiful contract tooling in the world will sit unused while data quietly
breaks in production exactly as before.</p>

<h2 id="what-to-actually-do">What to actually do</h2>

<p>If you’re trying to introduce contracts, resist the urge to start with the tooling.
Start with the relationship. Find the one upstream–downstream pair where breakage
hurts the most, and get <em>both</em> teams in a room. Write down, in plain language, what
the producer can commit to and what the consumer truly needs. Make the producer a
co-author, not a target. Then — and only then — encode the agreement in a schema and
wire up a check that fails <em>their</em> pipeline, not just yours.</p>

<p>The file you produce at the end will look like every “data contract” example on the
internet. But it will work, because underneath it sits the thing those examples
never show: two teams who actually agreed.</p>

<p>The schema is the artefact. The agreement is the contract. Don’t mistake one for
the other.</p>]]></content><author><name>dataarchitect.studio</name></author><summary type="html"><![CDATA[A schema check is the easy 10% of a data contract. The other 90% is an organizational agreement that no YAML file can enforce for you.]]></summary></entry><entry><title type="html">Star Schema vs Snowflake Schema: Which to Use and When</title><link href="https://dataarchitect.studio/essays/star-schema-vs-snowflake-schema/" rel="alternate" type="text/html" title="Star Schema vs Snowflake Schema: Which to Use and When" /><published>2026-05-12T00:00:00+05:30</published><updated>2026-05-12T00:00:00+05:30</updated><id>https://dataarchitect.studio/essays/star-schema-vs-snowflake-schema</id><content type="html" xml:base="https://dataarchitect.studio/essays/star-schema-vs-snowflake-schema/"><![CDATA[<p>The difference between a star schema and a snowflake schema is smaller than the
debate around it suggests. Both are dimensional models — facts in the middle,
dimensions around them. The entire distinction is one decision: <em>do you normalize
your dimension tables, or not?</em> Everything else follows from that single choice.
Let’s make it properly.</p>

<h2 id="the-one-real-difference">The one real difference</h2>

<p>In a <strong>star schema</strong>, each dimension is a single, flat, denormalized table. The
product dimension holds the product, its category, its brand, its supplier — all
in one wide table, even though category and brand repeat across many rows.</p>

<p>In a <strong>snowflake schema</strong>, you normalize those dimensions into a hierarchy. Product
points to a separate category table, which points to a department table; brand
lives in its own table; supplier in another. The single dimension “snowflakes” out
into a branching structure of smaller related tables — which is where the name
comes from.</p>

<p>That’s it. Star is denormalized dimensions; snowflake is normalized dimensions. If
you understand <a href="/essays/a-field-guide-to-dimensional-modeling/">why dimensional models split measurements from
context</a>, you already understand
both — snowflaking is just normalization applied to the context tables.</p>

<h2 id="what-snowflaking-buys-you">What snowflaking buys you</h2>

<p>Normalizing dimensions isn’t crazy; it has genuine, if narrow, advantages.</p>

<ul>
  <li><strong>Less storage and less redundancy.</strong> “Electronics” is stored once in a category
table instead of repeated on ten thousand product rows. On very large dimensions
this saves space.</li>
  <li><strong>Cleaner updates to shared attributes.</strong> Rename a category in one row rather than
in every product that shares it. Fewer places for an update to go wrong.</li>
  <li><strong>It mirrors how the source system already thinks.</strong> OLTP databases are normalized,
so a snowflake can feel like a more faithful translation of the upstream model.</li>
</ul>

<p>These were compelling reasons in 1998, when storage was expensive and warehouses
ran on row-based engines that struggled with wide tables. They are much weaker
reasons today.</p>

<h2 id="what-it-costs-you">What it costs you</h2>

<p>The costs of snowflaking land squarely on the two things analytics cares about
most: query simplicity and performance.</p>

<blockquote>
  <p>Every level of normalization is another join the analyst must write and the
engine must execute. A question that’s one join away in a star (“sales by
category”) becomes a three-table traversal in a snowflake.</p>
</blockquote>

<p><strong>Queries get more complex.</strong> Analysts now have to know the shape of the hierarchy
and join through it correctly. More joins mean more chances to get a query subtly
wrong — and more friction for every person who touches the data.</p>

<p><strong>Performance often degrades, not improves.</strong> This surprises people. The intuition
is that smaller tables are faster, but modern columnar warehouses (BigQuery,
Snowflake the product, Redshift, Databricks) are built to scan wide denormalized
tables efficiently and to compress repeated values away to almost nothing. The
storage you save by snowflaking is marginal, while the extra joins you add are
real work at query time. The denormalized star is usually the <em>faster</em> design on
exactly the engines most teams run today.</p>

<p><strong>Maintenance gets heavier.</strong> More tables, more relationships, more pipeline steps
to keep in sync. The “cleaner” model is often more brittle in practice.</p>

<h2 id="the-practical-verdict">The practical verdict</h2>

<p>For analytics on a columnar cloud warehouse — which is most analytics now — <strong>default
to the star schema.</strong> Denormalize your dimensions. The storage cost is negligible,
the query experience is dramatically simpler, and performance is typically better.
Optimizing for storage by normalizing is solving a 1998 problem with a 2026 bill.</p>

<p>Reach for snowflaking only in specific cases:</p>

<ul>
  <li>A dimension is <strong>genuinely enormous</strong> (tens of millions of rows) <em>and</em> a shared
attribute is large and highly repetitive, so the storage saving is material.</li>
  <li>You have a <strong>rapidly changing shared attribute</strong> where updating it in one
normalized place meaningfully reduces error or cost.</li>
  <li>A <strong>compliance or governance</strong> requirement forces a single authoritative table
for a particular entity.</li>
</ul>

<p>Even then, snowflake only the dimension that needs it. Mixing is fine — a mostly-star
model with one normalized dimension is a perfectly reasonable, pragmatic design. You
don’t owe the schema purity.</p>

<h2 id="the-thing-underneath-the-choice">The thing underneath the choice</h2>

<p>Notice that “star vs snowflake” is really a proxy for an older question:
<strong>normalize for write-efficiency, or denormalize for read-efficiency?</strong> A warehouse
is overwhelmingly read-heavy — written by a handful of pipelines, queried by
everyone. So it should optimize for reads, which means denormalizing, which means
the star. The snowflake optimizes for the case a warehouse rarely faces.</p>

<p>Pick the star by default. Snowflake a dimension only when you can name the specific
problem it solves. And don’t lose an afternoon to the debate — it was only ever one
decision wearing two names.</p>]]></content><author><name>dataarchitect.studio</name></author><summary type="html"><![CDATA[Star schema vs snowflake schema comes down to one decision — whether to normalize your dimensions. Here's the trade-off, and why the star usually wins in a modern warehouse.]]></summary></entry></feed>