Field Notes
What Is an Open Table Format? Iceberg, Delta, and Hudi Explained
An open table format is a metadata layer that turns a pile of files in object storage into a real table — one with ACID transactions, an enforced schema, and a queryable history. The data stays as ordinary Parquet files in S3, ADLS, or GCS; the table format is a small tree of metadata files alongside them that records, for every version of the table, exactly which data files belong to it and under what schema. “Open” means the spec is published and engine-neutral: Spark, Trino, Flink, Snowflake, and BigQuery can all read and write the same table without a proprietary gatekeeper. The three names that matter are Apache Iceberg, Delta Lake, and Apache Hudi — and as of 2026, the industry has largely converged on Iceberg as the neutral standard.
This is the ingredient that turns a data lake into a lakehouse. Everything else in the lakehouse story depends on it.
What problem does a table format solve?
A bare data lake is just files. Nothing says which files form the “orders” table, whether a half-finished write should be visible, or what the table contained yesterday. Two jobs writing at once can corrupt each other; a schema change means hoping every reader notices. A table format fixes all four failures at once:
| Bare files in a lake | With an open table format | |
|---|---|---|
| What is the table? | A folder, by convention | An explicit manifest of files |
| Concurrent writes | Unsafe, last-write wins | ACID commits, optimistic concurrency |
| Schema changes | Break readers silently | Versioned, safe evolution |
| History | Gone on overwrite | Snapshots + time travel |
| Deletes/updates | Rewrite everything | Row-level, via delete files or vectors |
How it works: a tree of metadata
Every table format is, at heart, the same trick: an atomic pointer to an immutable snapshot. A catalog holds one tiny pointer per table; the pointer names a metadata file; the metadata file names the snapshot; the snapshot lists the data files. Committing a change means writing new files and swapping one pointer — which is why a commit is atomic even on eventually-consistent object storage.
Because every snapshot is preserved until expired, you get time travel for free:
CREATE TABLE lake.sales.orders (
order_id BIGINT,
customer_id BIGINT,
order_ts TIMESTAMP,
amount DECIMAL(12,2)
) USING iceberg
PARTITIONED BY (days(order_ts));
-- yesterday's numbers, exactly as they were
SELECT sum(amount)
FROM lake.sales.orders
FOR TIMESTAMP AS OF current_timestamp - INTERVAL '1' DAY;
-- row-level change, committed atomically
MERGE INTO lake.sales.orders t
USING staged_orders s ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
That MERGE is the operation a bare lake never had — and it’s what makes patterns
like change data capture and an honest
medallion architecture practical
on files.
Iceberg vs Delta vs Hudi — where the war ended
For years this was a genuine three-way fight. It isn’t anymore. Iceberg became the neutral standard the moment every major platform adopted it: Databricks writes Iceberg through Unity Catalog, Snowflake ships first-class Iceberg tables and open-sourced its Polaris catalog, BigQuery reads it through BigLake, and Fabric bridges to it from Delta. Delta Lake remains a superb format and the native tongue of Databricks; UniForm lets a Delta table present itself as Iceberg. Hudi retains a niche for high-frequency upsert ingestion but has faded as a default choice. The 2026 spec work (Iceberg v3: deletion vectors, the variant type for semi-structured data, geo types) landed in both Snowflake and Databricks the same year — the formats are converging on the same capabilities.
Which is why the real architectural decision has moved up a layer: not which format, but which catalog — the component that holds those atomic pointers and governs who may swap them. That fight (Polaris, Unity, Glue, Horizon, Nessie) is still live, and it’s where lock-in now hides.
When you don’t need one
If your data lives in one warehouse, queried by one engine, an open table format adds moving parts and buys you little — the warehouse’s internal format already provides transactions and history. The format earns its complexity when the lakehouse premise applies: one cheap copy of data on object storage, many engines reading it, no per-engine ETL to keep copies in sync. That premise is why the modern data stack debate keeps circling back to open storage: whatever happens to the tools above, tables that no single vendor owns are the part everyone has agreed to keep.
Common questions
What is an open table format in simple terms?
It's a published specification for a metadata layer that sits on top of data files in object storage and makes them behave like a database table — tracking which files belong to the table, what schema they follow, and what the table looked like at any point in time. Apache Iceberg, Delta Lake, and Apache Hudi are the three main ones.
Is an open table format the same thing as a file format like Parquet?
No. Parquet defines how bytes are laid out inside one file. A table format defines how many files together form a table — which files are current, what schema applies, and how changes commit atomically. Iceberg, Delta, and Hudi all typically store their data in Parquet underneath.
Which open table format should I choose in 2026?
Apache Iceberg has become the de facto neutral standard — Databricks, Snowflake, BigQuery, and Fabric all read and write it. Delta Lake remains excellent inside the Databricks ecosystem, and interop layers like UniForm blur the line. For a new multi-engine lakehouse, Iceberg is the safe default; the more consequential choice now is the catalog.
Do I need an open table format if I use a data warehouse?
If all your data lives happily inside one warehouse and one engine, no — the warehouse's internal format already does this job. Table formats earn their keep when you want cheap object storage as the single copy of data and multiple engines querying it without replication.
Essays by email
One new essay on data architecture, straight to your inbox. No noise, unsubscribe anytime.