dataarchitect.studio

Field Notes

Data Lake vs Lakehouse: What Changed and Which to Use

The most common version of the storage question today isn’t warehouse-versus-anything — it’s data lake versus lakehouse. The short answer: a data lake stores raw files cheaply but gives you no guarantees about them; a lakehouse keeps those same files and adds a table layer on top that brings transactions, schema, and reliability. The lakehouse is, almost literally, a data lake plus one missing ingredient. Here’s what that ingredient is, why it changed everything, and when a plain lake is still enough.

Lake vs lakehouse, at a glance

  Data lake Lakehouse
Storage Raw files, object storage Raw files, object storage (same)
Transactions None ACID, via table formats
Schema On read, unenforced Enforced, with evolution
Concurrent writes Unsafe Safe
Time travel / history No Yes
Reliability for BI Low High
Main risk Becoming a data swamp Younger, more moving parts

What a data lake is — and where it breaks

A data lake is a large store of raw files in cheap object storage. You land data of any shape — tables, JSON, logs, images — as files, and apply structure later, on read. This is genuinely useful: storage is cheap, so you keep enormous volumes affordably, and you don’t have to model anything up front, which suits data science and ML.

The problem is everything a bare pile of files doesn’t give you. There are no transactions, so two jobs writing at once can leave data half-updated and corrupt. There is no enforced schema, so a malformed file silently poisons downstream reads. There is no reliable history, so you can’t cleanly reproduce yesterday’s numbers. Absent quality checks and ownership, the lake drifts into a data swamp — a vast store nobody trusts. The very looseness that makes a lake cheap and flexible is what lets it rot.

What a lakehouse adds: the table format

The lakehouse fixes this without giving up cheap storage. It keeps your data as files in object storage and adds a metadata layer on top — an open table format such as Apache Iceberg, Delta Lake, or Apache Hudi. That layer tracks which files make up a table, in what version, under what schema, and in doing so retrofits the guarantees a lake never had.

DATA LAKE LAKEHOUSE just files — no transactions, no schema, no guarantees table format: ACID · schema · time travel same cheap files + a metadata layer that makes them tables
A lakehouse is the same lake storage with an open table format layered on top — turning a pile of files into reliable, versioned tables.

Concretely, the table format gives you ACID transactions (writes are all-or-nothing, so concurrent jobs can’t corrupt a table), schema enforcement and evolution (bad data is rejected, and columns can change safely over time), and time travel (every change is versioned, so you can query the table as it was at any past point and reproduce old results exactly). None of that requires moving the data — it’s metadata over the files you already have. That’s the whole trick, and it’s why the lakehouse arrived as an evolution of the lake rather than a replacement for it.

A worked scenario

Picture a nightly job rewriting an orders table while a dashboard queries it. On a plain lake, the dashboard can read a half-written state — some new files present, some old ones already deleted — and return numbers that never actually existed. There’s no concept of a transaction to hide the in-progress write. On a lakehouse, the table format makes the rewrite a single atomic commit: the dashboard sees either the complete old version or the complete new one, never a torn mixture. Same files, same storage cost — but now the read is trustworthy. That gap is the entire reason lakehouses exist.

When a plain lake is still enough

You don’t always need the table layer. A bare lake is fine when you’re landing raw data that will be processed downstream anyway, doing exploratory data science where occasional inconsistency is acceptable, or archiving large volumes cheaply. In fact, most lakehouses keep a raw, untabled landing zone — the immutable bottom of the medallion architecture — and promote data into managed tables only as it’s cleaned and trusted.

What you should not do is run reliable, concurrent BI and analytics directly on a bare lake and expect warehouse-grade trust. That’s precisely the workload the table format was invented for, and skipping it is how teams end up debugging numbers that quietly changed mid-query.

The bottom line

A data lake gives you cheap, flexible storage and no guarantees. A lakehouse keeps the cheap, flexible storage and adds the guarantees, through an open table format layered over the same files. For most teams doing serious analytics on large data, the lakehouse is now the default, because it removes the lake’s biggest liability — trust — at almost no extra storage cost. Keep a raw zone for landing and exploration; put a table format over anything you actually want to depend on. And remember that, like every storage choice, it decides where your data lives and how reliable it is — not what it means, which is still a semantic-layer problem sitting one level up.

Common questions

What is the difference between a data lake and a lakehouse?

A data lake is raw files in cheap object storage with no transactions, no enforced schema, and no guarantees. A lakehouse keeps those same files but adds an open table format on top — Iceberg, Delta Lake, or Hudi — that provides ACID transactions, schema enforcement, and time travel. The lakehouse is a data lake plus a table layer.

Is a lakehouse just a data lake with extra features?

Essentially, yes — and that's the point. The lakehouse doesn't replace the lake's cheap file storage; it layers metadata over it so the same data gains the reliability and structure that analytics needs, without copying everything into a separate warehouse.

What is a data swamp?

A data lake that has degraded into an unusable pile of files — no consistent schema, no quality, no clear ownership, so nobody trusts what's in it. The lack of structure and governance that makes a lake cheap and flexible is also what lets it rot.

Do I still need a data lake if I have a lakehouse?

No — the lakehouse is built on lake storage, so it is your lake, with a table layer added. You don't run both. You may still keep a raw, untabled zone for landing data before it's promoted into managed tables.