dataarchitect.studio

Your AI Is Only as Good as Your Data Architecture

2026-05-31T09:30:00+05:30

There’s a comforting story in which generative AI makes data architecture less important — point a clever enough model at your data and it’ll just figure things out, schema be damned. The opposite is true. Every way an AI system touches your data, it does so more credulously and at greater scale than any human ever did, which means it is more exposed to bad structure, not less. GenAI doesn’t let you skip the architecture. It raises the price of getting it wrong.

The new consumers don’t sanity-check

For years, the consumers of your data were humans and dashboards. Humans have a saving grace: when a number looks wrong, they squint at it. An analyst who sees revenue triple overnight assumes a pipeline broke before they assume the company did. That skepticism has quietly compensated for a lot of shaky data architecture.

The new consumers have no such instinct. A retrieval-augmented chatbot, an autonomous agent, an LLM translating a question into SQL against your warehouse — each will take whatever your data hands it and present the result with total, fluent confidence.

A human analyst who gets a weird number investigates it. An LLM reports it — in a complete sentence, with no hint that anything is off. AI removes the last layer of human skepticism that was silently covering for bad data.

This is the shift that matters. When the consumer stops double-checking, the structure and correctness of the data have to carry the entire burden of trust. That burden is exactly what data architecture exists to bear.

RAG quality is retrieval quality is data quality

Take the most common pattern: retrieval-augmented generation, where a model answers using documents pulled from your own data rather than from its training. RAG is widely treated as a model problem — pick the right embedding model, tune the prompt. But the quality of a RAG system is dominated by the quality of retrieval, and retrieval quality is dominated by the state of the underlying data.

If your source documents are duplicated, contradictory, stale, or unlabelled, the retriever faithfully surfaces duplicated, contradictory, stale, unlabelled context — and the model dutifully reasons over garbage. Many “hallucinations” aren’t the model inventing things; they’re the model accurately summarising bad or conflicting retrieved data. The fix in those cases isn’t a better model. It’s deduplication, clear metadata, freshness guarantees, and source-of-truth discipline — the same deliberate shape the rest of your data needs. A RAG system sits directly on top of your data architecture and inherits every weakness in it.

When an AI queries your warehouse

The pattern with the highest stakes is letting a model query your warehouse directly — natural-language-to-SQL, or an agent with database access. The moment you do this, every latent ambiguity in your data becomes a live wire.

Suppose three tables each define “active user” slightly differently and none is marked canonical. A human analyst eventually learns which one to use, through folklore and scar tissue. An LLM has no folklore. It will pick a table, write plausible SQL, and report a number — and on the next question it may pick a different table and report a different number, each time with the same confidence. You’ve automated the production of inconsistent answers.

This is precisely why a semantic layer goes from nice-to-have to load-bearing the instant AI enters the picture. If “active user” and “revenue” are defined in exactly one governed place that the AI is made to query through, the model can’t improvise its own definitions. Without that layer, an AI data interface is a confident random-number generator wearing a tie. The semantic layer is the guardrail, and AI is what makes the guardrail non-optional.

Governance stops being paperwork

The same goes for the unglamorous disciplines that ambitious teams love to defer. Who owns this dataset? What’s it allowed to mean? Is it fresh? Are the contracts between producers and consumers actually honoured? These questions used to fail slowly and quietly — a stale table, a mildly wrong dashboard, an annoyed analyst. Feed the same ungoverned data to an AI system and the failure is fast, fluent, and scaled to every user who asks. Governance was always the foundation; AI just turned the lights on and showed everyone the cracks.

AI is an amplifier

The throughline is simple. Generative AI is an amplifier. Point it at well-structured, well-governed, single-source-of-truth data and it amplifies that quality into fast, trustworthy answers at a scale humans couldn’t match. Point it at the typical accreted swamp — duplicated facts, fuzzy definitions, unowned tables — and it amplifies that just as faithfully, manufacturing confident nonsense far faster than any human could.

So the arrival of AI doesn’t change the data architect’s job. It changes the consequences of doing it badly. The work — choosing the shape, defining the meaning, assigning the ownership, guaranteeing the quality — is the same work it always was. It has simply stopped being something you can get away with neglecting. (I’ve written separately about what GenAI actually changes versus what it doesn’t, for the fuller accounting.)

The teams that win with AI won’t be the ones with the cleverest prompts. They’ll be the ones whose data was already in deliberate shape when the AI showed up — because an amplifier is only ever as good as the signal you feed it.

What GenAI Actually Changes About Data Architecture — and What It Doesn’t

2026-05-31T08:30:00+05:30

Every few years a technology arrives that vendors insist changes everything about data architecture, and the honest answer is always the same: it changes some things, leaves most things alone, and the trick is telling which is which before you rebuild your stack around a slide deck. Generative AI is the current everything-changer. So let’s do the accounting plainly — what it genuinely changes, and what it pointedly does not.

What it actually changes

Three things are real, and worth taking seriously.

A new storage primitive: the vector. To retrieve data by meaning rather than by exact match, you store numerical representations — embeddings — and search them by similarity. That’s a genuinely new access pattern, and it’s brought vector indexes into the stack, whether as dedicated vector databases or as vector capabilities bolted onto databases you already run. If your applications need semantic search or retrieval-augmented generation, this is a real addition to your architecture, not hype.

Unstructured data becomes first-class. For decades, the analytical stack was built around structured, tabular data, and the piles of text, documents, images, and transcripts mostly sat untapped. GenAI gives you a practical way to extract meaning from that unstructured mass and put it to work. That genuinely expands what data architecture is responsible for — the lake and lakehouse patterns that store raw, varied data suddenly have a lot more to do.

A new retrieval pattern and a new consumer. RAG — fetch relevant context, then let a model reason over it — is a legitimately new shape for how data flows to an answer. And the consumer at the end of it is new in kind: a system that reads your data and acts on it without the human skepticism that used to be the last line of defence.

That’s the real list. Notice what’s on it: a new index type, a new data category brought into scope, a new retrieval flow. Additions to the architecture. Now notice what isn’t on it.

What it doesn’t change

The fundamentals. All of them. And not only do they survive — they get more load bearing, not less.

You still need deliberate structure. A vector index doesn’t absolve you of modeling; it sits alongside your structured data, which still has to be shaped on purpose. The dimensional models and governed tables don’t disappear because some of your data is now embeddings. If anything, AI consumers reading that structured data make its shape matter more.

You still need data quality. As I’ve argued at length, a RAG system is only as good as the data underneath it. Duplicated, stale, contradictory data produces duplicated, stale, contradictory answers — now delivered fluently and at scale. Embeddings don’t clean your data. They faithfully encode whatever mess you hand them.

You still need governance and ownership. Who owns this dataset, what is it allowed to mean, is it fresh — these questions don’t soften when an AI starts consuming the data. They sharpen, because the failure mode goes from a quietly wrong dashboard to a confidently wrong answer served to every user who asks.

You still need defined meaning. An LLM querying your warehouse with no single governed definition of “revenue” will invent its own, repeatedly and inconsistently. The semantic layer doesn’t become obsolete in the AI era; it becomes the guardrail that makes the AI era survivable.

GenAI adds a vector index to the stack. It does not subtract the need for structure, quality, governance, or meaning. The new thing is additive; the old things are load-bearing.

The pattern to be skeptical of

Here’s where the hype does its damage. A great deal of “AI-native data platform” messaging is, underneath, selling you a vector database and a retrieval pipeline while implying you can now skip the boring disciplines — that the model is smart enough to paper over messy, ungoverned, ambiguous data. This is the same move every hype cycle makes, and it fails the same way. As with the medallion architecture, the trouble starts when a useful addition gets sold as a replacement for the fundamentals it actually depends on. A vector index on top of a swamp is just a faster way to retrieve swamp.

The tell is always the same: any pitch that lets you defer data quality, ownership, or definition because “the AI handles it” is selling you a way to scale your existing problems, not solve them.

The fundamentals are a moat, not a relic

So treat GenAI the way you’d treat any genuine-but-overhyped advance. Adopt the real additions — vector storage where you need semantic retrieval, the means to bring unstructured data into scope, RAG where it fits. And refuse the implied permission to neglect everything else, because everything else is precisely what determines whether the AI produces signal or confident noise.

The reassuring part, if you’ve spent years on the unglamorous work, is this: the fundamentals you’ve been told are about to be made obsolete are in fact the thing that separates teams who get value from AI from teams who get a very fast nonsense machine. Deliberate structure, clean data, clear ownership, defined meaning — these aren’t a relic the new era renders quaint. They’re the moat. AI just raised their value. The everything-changer, it turns out, mostly changed the stakes of getting the basics right.

Slowly Changing Dimensions, Explained Without the Jargon

2026-05-31T00:00:00+05:30

“Slowly changing dimensions” is an intimidating name for a simple idea. A dimension describes context — a customer, a product, a store. Over time, that context changes: the customer moves city, the product gets recategorised, the store switches region. A slowly changing dimension, or SCD, is just a strategy for what happens to history when one of those attributes changes. That’s the entire concept. The types — 1, 2, 3 — are only different answers to one question: do you keep the old value, or overwrite it?

If you’ve read the field guide to dimensional modeling, you’ve met this idea in passing. Here’s the full version.

The question every dimension eventually asks

Say you have a customer dimension, and the customer C-1077 was in the “Small Business” segment when they placed an order last year. This year they’ve grown and been moved to “Enterprise.” Someone now runs a report: sales by segment, last year. Should C-1077’s old orders count under “Small Business” (what they were at the time) or “Enterprise” (what they are now)?

There is no universally correct answer — only a business decision. SCD types are the menu of ways to implement whichever answer your business needs.

An SCD type is a technical setting that encodes a business choice: when reality changes, does our history change with it, or stay as it was?

Type 1 — Overwrite (forget the past)

The simplest strategy: when an attribute changes, you just update the value in place. The old value is gone.

UPDATE dim_customer
   SET segment = 'Enterprise'
 WHERE customer_id = 'C-1077';

Now all of C-1077’s orders — past and future — report under “Enterprise,” because that’s the only value the dimension remembers. Last year’s “sales by segment” will retroactively change.

Use Type 1 when history doesn’t matter for that attribute — correcting a misspelled name, fixing a data-entry error, or tracking a value where only the current state is ever meaningful. It’s cheap and easy. The cost is amnesia: you can never again ask what things looked like before.

Type 2 — Add a new row (keep full history)

This is the important one, the strategy people usually mean when they say “SCD.” On change, you don’t overwrite — you expire the old row and insert a new one. The customer now exists as two rows: the historical version and the current version, each valid for a span of time.

dim_customer
+-------------+-------------+------------+------------+------------+------------+
| customer_key| customer_id | segment    | valid_from | valid_to   | is_current |
+-------------+-------------+------------+------------+------------+------------+
| 4401        | C-1077      | Small Bus. | 2024-02-01 | 2026-03-14 | false      |
| 8902        | C-1077      | Enterprise | 2026-03-15 | 9999-12-31 | true       |
+-------------+-------------+------------+------------+------------+------------+

Each fact (each order) points to the customer_key that was current when the order happened. Last year’s orders reference key 4401 (“Small Business”); this year’s reference 8902 (“Enterprise”). History stays honest — “sales by segment, last year” returns what was actually true then, and it never changes.

Two things make Type 2 work, and both are worth noticing. First, the validity columns (valid_from, valid_to, is_current) are what let two versions of “the same” customer coexist and be queried by date. Second — and this is why I keep linking them — Type 2 is only possible because of a surrogate key. The natural key C-1077 is identical on both rows; it’s the meaningless surrogate (4401 vs 8902) that distinguishes the versions and gives facts something stable to point at. Without a surrogate key, you simply cannot represent history this way.

The cost of Type 2 is that the dimension grows over time and your pipeline gets more complex — it has to detect changes, expire old rows, and insert new ones correctly. For most analytics where history matters, that complexity is worth paying.

Type 3 — Add a column (keep limited history)

A middle option, used far less often. Instead of a new row, you keep both the old and new value side by side in columns:

customer_id | current_segment | previous_segment | segment_changed_on
C-1077      | Enterprise      | Small Business   | 2026-03-15

This preserves exactly one step of history — you can see the current and the immediately prior value, but nothing further back. Type 3 fits the narrow case where you care about “before and after” a single known transition (a re-org, a re-branding) and don’t need full history. It’s a specialist tool; reach for Type 2 when you want history in general, and Type 3 only for this specific shape of question.

Which to use

The choice is per-attribute, not per-table — a single dimension can mix strategies. A customer’s segment might be Type 2 (you want historical accuracy for reporting), while a typo correction in their name is Type 1 (no one wants to preserve the misspelling). Decide attribute by attribute:

Type 1 — you only ever care about the current value, or you’re fixing an error.
Type 2 — you need to report on history as it actually was. This is the workhorse; when in doubt, this is usually the right default for attributes that carry analytical meaning.
Type 3 — you need exactly one prior value around a specific, known change.

The trap to avoid

The classic mistake isn’t choosing the wrong type — it’s not choosing at all. Teams default to silent Type 1 (overwriting) simply because that’s what a plain UPDATE does, and only discover the problem months later when someone asks for a historical breakdown and finds the past has been quietly rewritten. By then the old values are gone and unrecoverable.

So make the decision deliberately, attribute by attribute, before the data starts flowing. That act — deciding what history means before you need it — is exactly the kind of small, deliberate choice that separates a designed dimension from one that merely accumulated. Slowly changing dimensions aren’t arcane. They’re just the place where you decide, on purpose, what your data is allowed to forget.

Data Warehouse vs Data Lake vs Lakehouse: A Clear Comparison

2026-05-30T00:00:00+05:30

Three terms get used almost interchangeably and mean genuinely different things: the data warehouse, the data lake, and the lakehouse. The confusion is understandable, because all three are “places you put data for analytics.” But they make opposite bets about structure, cost, and trust, and choosing well means understanding the bet each one makes. Here’s the clear version.

Data warehouse: structure first

A data warehouse stores structured, modeled data, optimised for analytical queries. Before data lands in a warehouse, it’s cleaned, shaped, and fitted to a schema — an approach called schema-on-write, because the structure is enforced at the moment you write the data in.

This is the world of dimensional models, star schemas, and curated tables. The warehouse’s whole value proposition is trust and speed for analytics: because everything is modeled and typed up front, queries are fast, results are consistent, and a business analyst can point a BI tool at it and get reliable answers without thinking about plumbing.

The costs are real, though. Schema-on-write means upfront modeling work before data is usable, and rigidity afterward — adding a new data source or changing shape takes deliberate effort. Warehouses also traditionally couple storage and compute and charge accordingly, which gets expensive at scale and makes them a poor home for huge volumes of raw or semi-structured data (logs, images, free text) that don’t fit neatly into columns.

Data lake: flexibility first

A data lake makes the opposite bet. It stores raw data of any shape — structured, semi-structured, unstructured — as files in cheap object storage (S3, GCS, Azure Blob). There’s no schema required to write; you impose structure later, when you read, an approach called schema-on-read.

This buys two things warehouses struggle with. First, cost: object storage is cheap, so you can keep enormous volumes of data affordably. Second, flexibility: you can dump data in now and decide what to do with it later, which suits data science, machine learning, and any workload that wants raw, untransformed inputs. The lake is also the natural home for immutable raw history you can always reprocess from.

But flexibility has a failure mode, and it has a name: the data swamp. With no enforced schema, no guaranteed quality, and often no clear ownership, a lake can degrade into a vast pile of files nobody trusts or understands. There’s also no built-in transactional guarantee — no clean notion of “this table is in a consistent state right now” — which makes lakes hard to use for the reliable, concurrent analytics that warehouses do effortlessly. A lake gives you everything and promises nothing.

Lakehouse: an attempt to have both

The lakehouse is the newer idea, and it’s exactly what the name suggests: an attempt to get warehouse-like reliability on data-lake economics. You keep your data in cheap object storage as files — the lake part — but add a metadata and table layer on top that brings the structure and guarantees a warehouse has.

The lakehouse bet: keep the cheap, flexible storage of a lake, but add a layer that gives it the schema, transactions, and trust of a warehouse — so you don’t have to run and sync both.

That added layer is delivered by open table formats — Apache Iceberg, Delta Lake, Apache Hudi — which sit over the raw files and provide the things lakes lacked: ACID transactions (consistent reads and writes), schema enforcement and evolution, time-travel to past versions, and performance optimisations. The result is a single system where you can run both the flexible, ML-style workloads of a lake and the reliable, structured BI of a warehouse, without maintaining two separate platforms and a brittle pipeline copying between them.

The trade-off is maturity and complexity. The lakehouse stack is younger and has more moving parts than a turnkey warehouse; you’re assembling storage, a table format, a query engine, and a catalog rather than buying one integrated product. For some teams that flexibility is the point; for others it’s overhead they don’t need.

How to choose

Strip away the marketing and it comes down to your workload:

Choose a warehouse when your work is overwhelmingly structured analytics and BI, you value simplicity and reliability over flexibility, and your data volumes are manageable. For a team that mostly builds dashboards and reports, a warehouse is still the simplest, most dependable answer — don’t over-engineer past it.
Choose a lake when you have large volumes of raw, varied, or unstructured data, heavy data-science or ML needs, and the engineering discipline to stop it becoming a swamp. Rarely the whole answer on its own anymore.
Choose a lakehouse when you genuinely need both — structured BI and raw/ML workloads on the same large datasets — and want to avoid running a separate lake and warehouse with a sync pipeline between them. This is where many teams are converging, but adopt it because you have both needs, not because it’s the newest word.

The part none of them solve

One caution worth ending on. Whichever you pick, it answers where data is stored and how it’s structured — a physical question. None of the three tells you what your data means: which definition of “revenue” is canonical, who owns the customer table, how “active user” is defined. That’s the job of a semantic layer, which sits above all three. Teams often expect a shiny new lakehouse to fix their consistency problems and are surprised when the same three-different-numbers arguments continue — because storage was never the thing causing them. Choose the right store for your workload. Then govern the meaning on top of it. They’re different problems, and you need answers to both.

How to Make a Data Pipeline Idempotent

2026-05-29T00:00:00+05:30

A data pipeline that can’t be safely re-run is a liability waiting for a bad night. Jobs fail halfway. Schedulers retry. Someone kicks off a backfill over a range that already partly ran. If any of those scenarios can corrupt your data — duplicate rows, double-counted revenue, inconsistent state — you don’t have a pipeline, you have a trap. The property that defuses all of it is idempotency, and building it in is more about a few disciplined patterns than about clever code.

What idempotency actually means

An operation is idempotent if running it many times produces the same result as running it once. Press a floor button in an elevator five times; the elevator still goes to the floor once. For a pipeline, it means: re-running the same job with the same input leaves the data in exactly the same correct state, no matter how many times it executes.

This is not the same as “exactly-once processing,” which is a hard distributed-systems guarantee about each input being handled precisely one time. Idempotency is more achievable and, for most batch and many streaming pipelines, more useful: you stop trying to guarantee the job runs once, and instead make it safe to run any number of times. Retries become boring. Backfills become safe. That’s the whole prize.

Don’t try to guarantee your job runs exactly once. Make it not matter how many times it runs.

The anti-pattern to eliminate first

The single most common idempotency killer is the blind append:

INSERT INTO fact_sales
SELECT * FROM staging_sales WHERE sale_date = '2026-05-29';

Run this twice and you have every row twice. The job has no notion of “I already did this date” — it just appends. Every pattern below is, at heart, a way to replace blind appends with operations that can absorb a re-run without duplicating.

Pattern 1: Overwrite by partition

The simplest and most robust pattern: make your unit of work a partition (most often a date), and have the job replace that partition rather than add to it.

DELETE FROM fact_sales WHERE sale_date = '2026-05-29';
INSERT INTO fact_sales
SELECT * FROM staging_sales WHERE sale_date = '2026-05-29';

Or, in a partitioned lake/warehouse, overwrite the single partition atomically (INSERT OVERWRITE, replaceWhere, dynamic partition overwrite — the syntax varies by engine). Now re-running the job for 2026-05-29 produces the same result every time: it clears the day and rebuilds it. Retries and backfills over any date range are automatically safe, because each date is rebuilt from scratch from its source.

This works beautifully when your data is naturally partitioned by an immutable processing window and you can fully recompute a partition from its inputs. Make the delete-and-load atomic so a failure mid-way can’t leave the partition empty.

Pattern 2: Merge (upsert) on a key

When you’re maintaining mutable state — a dimension, a “current status per entity” table — overwriting partitions doesn’t fit. Here the tool is a merge keyed on a stable identifier:

MERGE INTO dim_customer AS t
USING staging_customer AS s
  ON t.customer_id = s.customer_id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...;

Because the merge keys on customer_id, running it twice with the same input is a no-op the second time — matched rows are updated to the same values, and nothing new is inserted. This is why a stable surrogate or business key matters so much: idempotent upserts are only possible if every row has a reliable identity to match on. No key, no safe merge.

Pattern 3: Deterministic output keys

For systems without merge, you can lean on the storage layer’s uniqueness. Compute a deterministic primary key for each output row from its inputs — say, a hash of the natural key plus the event timestamp — and write with insert-or-ignore / upsert semantics. Because the same input always produces the same key, a re-run collides with the rows it wrote last time instead of duplicating them. The key is the idempotency.

The rule for side effects

Transformations on data inside your warehouse are the easy case. The dangerous case is external side effects — sending an email, posting to an API, incrementing a counter, dropping a message on a queue. These are rarely idempotent by default: retry the step and you send the email twice.

The fixes are the same family of ideas applied outward: attach an idempotency key to each external action and have the receiving system de-duplicate on it (most serious payment and messaging APIs support exactly this); or record what you’ve already done in a state table and check it before acting. The principle holds — make the effect of running twice identical to running once.

A checklist

Before you call a pipeline safe to re-run, confirm:

Re-running the same job for the same window produces identical output — verify it, don’t assume it.
There are no blind INSERTs; every write is an overwrite, a merge, or keyed.
Mutable tables are merged on a stable key.
Every external side effect carries an idempotency key or a “have I done this?” check.
A job that fails mid-run leaves no partial, half-correct state behind.

Get these right and the 3 a.m. retry stops being an incident. The scheduler can hammer the job, an engineer can rerun last month’s backfill without flinching, and the data lands in exactly one correct state every time — which is the entire point of building the thing carefully in the first place.

What Is a Semantic Layer, and Why Does Your Data Stack Need One?

2026-05-28T00:00:00+05:30

Here is a problem every data team eventually has. Three dashboards show “active users.” All three numbers are different. Each was built by a different person who made a slightly different, undocumented choice about what “active” means — a 30-day window here, an exclusion of internal accounts there. No one is wrong, exactly, and yet the organisation can no longer answer a simple question about itself. A semantic layer is the architectural answer to this problem.

A definition

A semantic layer is a single, governed place where your business concepts and metrics are defined once, in terms the business uses, independent of where the underlying data physically lives or which tool consumes it.

Instead of the definition of “active user” living implicitly inside a dozen dashboard queries, it lives in one explicit place — active_user = a user with at least one session in the trailing 30 days, excluding internal accounts — and every dashboard, notebook, and API asks the semantic layer for that metric rather than re-deriving it. Define once; consume everywhere; agree always.

A semantic layer turns a metric from something each report re-implements into something the organisation declares. The definition stops being folklore and becomes infrastructure.

What it sits between

Picture two halves of your stack. Below is the physical half: warehouse tables, the raw and refined layers, the actual columns and rows. Above is the consumption half: BI dashboards, ad-hoc SQL, notebooks, reverse-ETL, the AI tool someone just wired up to your warehouse.

The semantic layer sits in the middle and translates between them. It maps messy physical reality (“revenue is net_amount from fact_order_line, minus refunds joined from fact_returns, in the reporting currency”) to a clean business concept (“revenue”) that every consumer can request by name. Change the physical implementation underneath — refactor a table, fix a join — and consumers above never notice, because they only ever referred to “revenue,” not to the columns.

What it is not

Three confusions are worth clearing up, because each one leads teams to think they have a semantic layer when they don’t.

It is not just your BI tool. Tools like Looker, Power BI, and Tableau have had semantic modeling features for years, and that’s good — but if the definitions live only inside one BI tool, then your notebooks, your SQL users, and your other tools don’t share them. The “active user” defined in the BI tool and the one a data scientist computes in a notebook drift apart again. A real semantic layer is consumer-agnostic: every tool, BI or not, gets the same answer.

It is not the gold layer. This is the deeper confusion. As I argued in reconsidering the medallion architecture, bronze-silver-gold describes how data gets cleaner — physical refinement. It does not describe what data means. Teams keep trying to solve semantic consistency by building more gold tables, and it never quite works, because a pile of curated tables with no single governed definition of “revenue” is just a tidier way to produce three different revenue numbers. The semantic layer is the thing medallion structurally doesn’t give you.

It is not a data contract. Contracts govern the promises between a producer and a consumer of a dataset. A semantic layer governs the meaning of metrics consumed downstream. Related, complementary, not the same.

Why it’s having a moment

Two forces have pushed the semantic layer from “nice idea” to “increasingly necessary.” The first is the rise of headless BI and dedicated metrics layers — the idea that metric definitions deserve their own governed tier, queryable by any tool through an API, rather than being trapped inside one dashboard product.

The second is AI consumers. The moment you let a language model query your warehouse, the cost of ambiguous metrics goes up sharply. A human analyst who gets a weird number will sanity-check it; an LLM will confidently report whatever the schema hands it. If “revenue” isn’t defined in exactly one governed place, an AI tool will cheerfully compute its own version and present it as fact. A semantic layer is fast becoming the guardrail that makes querying-by-AI trustworthy at all.

The takeaway

If your team argues about whose number is right, you don’t have a tooling problem — you have a missing semantic layer. Somewhere between your tables and your dashboards, there should be one governed place that says what each metric means, so that every consumer asks the same question and gets the same answer. Build that, and the three-dashboards-three-numbers problem doesn’t get patched. It stops being possible.

The Medallion Architecture, Reconsidered

2026-05-27T00:00:00+05:30

The medallion architecture — bronze, silver, gold — has become the default mental model for organising a data lakehouse, and for good reason. It’s memorable, it maps to an intuition everyone shares, and it gives a sprawling pile of data a sense of direction. I’ve recommended it. I still do, sometimes. But defaults have a way of hardening into dogma, and this one has hardened more than most. It’s worth a second, honest look.

What the layers get right

Stripped to its essence, the pattern says: data should flow through stages of increasing refinement and trust. Bronze is raw, as-ingested, untouched — history you can always replay from. Silver is cleaned, conformed, deduplicated — the validated, queryable middle. Gold is business-ready — the curated, aggregated tables that feed dashboards and decisions.

Two things here are genuinely valuable and worth keeping no matter what you call the layers.

The first is provenance. Keeping raw data immutable in bronze means you can always answer “where did this number come from?” and, crucially, reprocess when you discover a bug in your logic. Throwing away raw data to save space is one of the few truly irreversible mistakes in this field. Bronze, as a discipline, guards against it.

The second is progressive refinement as an explicit idea — the recognition that raw data and decision-ready data are different things with different consumers, and that the transformation between them deserves to be a deliberate, staged, testable process rather than one heroic query.

Those two ideas are good architecture. Hold onto them.

Where it quietly falls apart

The trouble starts when the three layers stop being a guideline and start being a rule — when every table must belong to a tier and every problem must be solved by adding another one. Several failure modes recur.

Gold sprawls. “Business-ready” has no natural boundary, so gold becomes a dumping ground of one-off aggregates, each built for a single dashboard, each subtly re-deriving metrics the others already computed. You end up with the exact inconsistency the architecture was supposed to prevent, now wearing a gold badge. Three tables claim to hold “monthly revenue” and all three disagree.

Silver becomes a swamp. Without a clear contract for what “cleaned and conformed” means, silver fills with tables that are sort of clean — half-modeled, inconsistently grained, owned by no one. It’s no longer raw and not yet trustworthy, which is the worst of both worlds: a layer everyone reads from and no one believes.

The layers masquerade as a semantic layer. This is the deepest problem. Medallion describes physical refinement — how raw bytes become clean tables. It says nothing about meaning — what a “customer” is, how “active” is defined, which metric is canonical. Teams keep trying to solve semantic problems by adding more gold tables, and it never works, because the layers were never about semantics. A bronze→silver→ gold pipeline with no governed definitions on top is just a very tidy way to produce inconsistent numbers.

Medallion answers “how clean is this data?” It does not answer “what does this data mean, and who decides?” — and most data disasters are failures of the second question, not the first.

A more honest framing

I’ve stopped presenting medallion as the architecture and started presenting it as what it actually is: a useful convention for refinement, which must be paired with two things it doesn’t supply.

The first is ownership per dataset, not per layer. The question that matters isn’t “which tier is this in?” but “who owns this table’s shape and meaning?” A gold table no one owns is more dangerous than a bronze one, because people trust it. Tiers are a property of data; ownership is a property of teams. Don’t confuse them.

The second is a semantic layer that sits across the tiers — a single governed place where business concepts and metrics are defined once, regardless of which physical table they’re computed from. This is the part medallion structurally cannot give you, and it’s the part that actually determines whether your organisation agrees on what its numbers mean.

So: keep bronze’s immutable provenance. Keep the discipline of staged refinement. But hold the three-layer model loosely. It’s a sketch of how data gets cleaner, not a theory of how data gets meaning. Treat it as the former and it serves you well. Mistake it for the latter — as a surprising number of teams do — and you’ll spend a year building gold tables to solve a problem that was never about gold at all.

OLTP vs OLAP: Why You Shouldn’t Run Analytics on Your App Database

2026-05-23T00:00:00+05:30

Every data architecture eventually runs into the difference between OLTP and OLAP, usually the hard way: someone runs a big analytical query against the production application database, and the app slows to a crawl for everyone. The two acronyms describe two kinds of database workload that are optimised for opposite things, and understanding why is the foundation under nearly every decision about where analytics should live.

Two opposite jobs

OLTP — Online Transaction Processing — is the workload your application runs. It’s defined by many small, fast operations: a user places an order, updates their profile, adds an item to a cart. Thousands of these happen concurrently, each touching a handful of rows, mixing reads and writes. The database behind your app — Postgres, MySQL, SQL Server — is an OLTP system, and it’s tuned to do this well: insert a row, update a record, look up a single customer by ID, all in milliseconds.

OLAP — Online Analytical Processing — is the workload your analytics runs. It’s defined by few large, complex operations: “total revenue by product category by month for the last three years.” A single such query might scan millions or billions of rows, aggregate them, and join across large tables. There are far fewer of these queries, they’re almost entirely reads, and each one is enormous compared to an OLTP operation.

OLTP is many people each touching a little data. OLAP is a few people each touching a lot of data. Almost every difference between the two follows from that.

Why the same database can’t do both well

The two workloads don’t just differ in size — they pull the underlying design in incompatible directions. Three differences matter most.

Row storage vs column storage. OLTP databases store data by row, because a transaction typically wants a whole record at once (give me everything about this order). OLAP systems store data by column, because an analytical query typically wants one or two columns across a vast number of rows (give me the amount column for ten million orders, and sum it). Reading a single column from row-stored data means touching every row to extract one field — slow. Column storage reads just that column — fast. The storage layout that’s right for one workload is actively wrong for the other.

Normalized vs denormalized. OLTP schemas are normalized — data split across many tables to avoid redundancy and keep writes consistent and cheap. That’s ideal for transactions but painful for analysis, where answering a business question means joining a dozen normalized tables together every time. OLAP systems are denormalized — data deliberately pre-joined into wide tables and dimensional models like star schemas — so analytical queries are simple and fast. Again: opposite choices, each correct for its own job.

Contention. This is the one that bites you in production. A heavy OLAP query scanning millions of rows consumes huge amounts of memory, CPU, and I/O, and can hold locks or saturate the database while it runs. On a dedicated analytical system, fine — that’s what it’s for. On your production OLTP database, that same query starves the fast little transactions your application depends on, and real users feel it: checkouts hang, pages time out. You’ve made your app slow to compute a report.

The architectural answer

Because one system can’t serve both workloads well, the standard architecture is to keep them separate and move data from one to the other. Your application writes to its OLTP database, optimised for transactions. On some cadence — batch jobs, or change data capture streaming changes continuously — that data is copied into a separate analytical store (a warehouse or lakehouse) optimised for OLAP, where it’s reshaped into denormalized, column-stored, analytics-friendly models.

This separation is why the modern data stack looks the way it does. The warehouse isn’t a second copy of your database for no reason — it exists precisely because the analytical workload needed its own home, with the opposite storage model, the opposite schema design, and its own compute that can’t slow down your app. The pipelines that keep it in sync are the bridge between the two worlds.

The rule of thumb

When you catch yourself about to run analytics against a production application database, stop and recognise the workload mismatch. A little reporting on a small app is survivable; real analytics at scale is not. The instinct to “just query the prod DB” is the instinct that takes the site down at month-end close.

Keep transactions on OLTP. Keep analytics on OLAP. Move data deliberately between them. It looks like more infrastructure than necessary right up until the moment a single analyst’s query would have frozen your checkout flow — and then it looks like exactly the right amount.

Data Contracts Are a Cultural Problem

2026-05-19T00:00:00+05:30

The phrase “data contract” makes the hard thing sound easy. It conjures a tidy artefact — a YAML file, a JSON schema, a validation step in CI — and implies that once you have the artefact, you have the contract. You do not. The artefact is the easy ten percent. The other ninety percent is an agreement between human beings about who owes what to whom, and no schema validator has ever enforced that.

What a contract actually is

A data contract is a promise from a producer to a consumer. The producer says: this data will arrive, in this shape, with this meaning, at this cadence, with these guarantees — and if any of that is going to change, you’ll hear about it before it breaks you.

Read that promise again and notice how little of it is about types. “In this shape” is the schema part, and it’s genuinely useful — catching a column rename before it reaches production is a real win. But “with this meaning,” “at this cadence,” “you’ll hear about it before it breaks you” — these are commitments, not constraints. They describe a relationship. And relationships are not enforced by files; they’re enforced by accountability.

A schema tells you the status column is a string. It cannot tell you that status = 'closed' quietly started meaning two different things last March because an upstream team shipped a feature and told no one.

That second failure — the semantic drift, the silent change of meaning — is the one that actually destroys trust in data. And it’s precisely the failure a schema check sails straight past.

Why contracts fail

When data contract initiatives fail, and many do, the autopsy almost always finds the same three causes. None of them are technical.

No one owns the producing system. The data is emitted by a service whose team considers analytics someone else’s problem. They didn’t agree to a contract; a contract was declared at them by a downstream team with no leverage. The first time shipping velocity conflicts with the contract, the contract loses, because no one on the producing side ever signed up to defend it.

There’s no consequence for breaking it. A promise with no cost for breaking it is not a promise; it’s a suggestion. If a producer can break the contract and the only result is a Slack message from an annoyed analyst, the contract has no teeth. Real contracts are backed by something — a failing build that blocks their deploy, an SLA tied to a team’s objectives, a review gate they actually have to pass.

It was written by the wrong people. Contracts drafted entirely by the data team encode what the data team wishes were true, not what the producer can actually commit to. A contract the producer didn’t help write is a wish list, and wish lists rot.

The cultural prerequisites

This is why I’ve come to think of data contracts as a cultural problem with a technical surface. Before the YAML is worth writing, three things have to be true in the organisation:

Producers accept that their output is a product. The moment a team emits data that another team depends on, they are running a product whether they like it or not. The contract just makes the existing dependency explicit. Cultures that resist this resist the contract.
Ownership is unambiguous. Every important dataset has a name attached — a team that is genuinely responsible for its shape and meaning, with the authority to make and defend decisions about it. Shared ownership is no ownership.
Breaking changes have a cost the producer feels. Until the producer has skin in the game, the contract is decoration.

Get those right and the technical part becomes almost trivial — a schema in version control, a check in CI, a changelog, a deprecation window. Get them wrong and the most beautiful contract tooling in the world will sit unused while data quietly breaks in production exactly as before.

What to actually do

If you’re trying to introduce contracts, resist the urge to start with the tooling. Start with the relationship. Find the one upstream–downstream pair where breakage hurts the most, and get both teams in a room. Write down, in plain language, what the producer can commit to and what the consumer truly needs. Make the producer a co-author, not a target. Then — and only then — encode the agreement in a schema and wire up a check that fails their pipeline, not just yours.

The file you produce at the end will look like every “data contract” example on the internet. But it will work, because underneath it sits the thing those examples never show: two teams who actually agreed.

The schema is the artefact. The agreement is the contract. Don’t mistake one for the other.

Star Schema vs Snowflake Schema: Which to Use and When

2026-05-12T00:00:00+05:30

The difference between a star schema and a snowflake schema is smaller than the debate around it suggests. Both are dimensional models — facts in the middle, dimensions around them. The entire distinction is one decision: do you normalize your dimension tables, or not? Everything else follows from that single choice. Let’s make it properly.

The one real difference

In a star schema, each dimension is a single, flat, denormalized table. The product dimension holds the product, its category, its brand, its supplier — all in one wide table, even though category and brand repeat across many rows.

In a snowflake schema, you normalize those dimensions into a hierarchy. Product points to a separate category table, which points to a department table; brand lives in its own table; supplier in another. The single dimension “snowflakes” out into a branching structure of smaller related tables — which is where the name comes from.

That’s it. Star is denormalized dimensions; snowflake is normalized dimensions. If you understand why dimensional models split measurements from context, you already understand both — snowflaking is just normalization applied to the context tables.

What snowflaking buys you

Normalizing dimensions isn’t crazy; it has genuine, if narrow, advantages.

Less storage and less redundancy. “Electronics” is stored once in a category table instead of repeated on ten thousand product rows. On very large dimensions this saves space.
Cleaner updates to shared attributes. Rename a category in one row rather than in every product that shares it. Fewer places for an update to go wrong.
It mirrors how the source system already thinks. OLTP databases are normalized, so a snowflake can feel like a more faithful translation of the upstream model.

These were compelling reasons in 1998, when storage was expensive and warehouses ran on row-based engines that struggled with wide tables. They are much weaker reasons today.

What it costs you

The costs of snowflaking land squarely on the two things analytics cares about most: query simplicity and performance.

Every level of normalization is another join the analyst must write and the engine must execute. A question that’s one join away in a star (“sales by category”) becomes a three-table traversal in a snowflake.

Queries get more complex. Analysts now have to know the shape of the hierarchy and join through it correctly. More joins mean more chances to get a query subtly wrong — and more friction for every person who touches the data.

Performance often degrades, not improves. This surprises people. The intuition is that smaller tables are faster, but modern columnar warehouses (BigQuery, Snowflake the product, Redshift, Databricks) are built to scan wide denormalized tables efficiently and to compress repeated values away to almost nothing. The storage you save by snowflaking is marginal, while the extra joins you add are real work at query time. The denormalized star is usually the faster design on exactly the engines most teams run today.

Maintenance gets heavier. More tables, more relationships, more pipeline steps to keep in sync. The “cleaner” model is often more brittle in practice.

The practical verdict

For analytics on a columnar cloud warehouse — which is most analytics now — default to the star schema. Denormalize your dimensions. The storage cost is negligible, the query experience is dramatically simpler, and performance is typically better. Optimizing for storage by normalizing is solving a 1998 problem with a 2026 bill.

Reach for snowflaking only in specific cases:

A dimension is genuinely enormous (tens of millions of rows) and a shared attribute is large and highly repetitive, so the storage saving is material.
You have a rapidly changing shared attribute where updating it in one normalized place meaningfully reduces error or cost.
A compliance or governance requirement forces a single authoritative table for a particular entity.

Even then, snowflake only the dimension that needs it. Mixing is fine — a mostly-star model with one normalized dimension is a perfectly reasonable, pragmatic design. You don’t owe the schema purity.

The thing underneath the choice

Notice that “star vs snowflake” is really a proxy for an older question: normalize for write-efficiency, or denormalize for read-efficiency? A warehouse is overwhelmingly read-heavy — written by a handful of pipelines, queried by everyone. So it should optimize for reads, which means denormalizing, which means the star. The snowflake optimizes for the case a warehouse rarely faces.

Pick the star by default. Snowflake a dimension only when you can name the specific problem it solves. And don’t lose an afternoon to the debate — it was only ever one decision wearing two names.