Field Notes
What Is Data Lineage, and What Is It Actually For?
Data lineage is the family tree of your data: a record of where each dataset comes from, what transformations it passes through, and everything downstream that depends on it. Pick any column in the warehouse and lineage answers two directions of question — upstream, where did this come from and what shaped it; downstream, what would break if I changed it. Simple idea, genuinely valuable, and currently sold with enough mystique that it’s worth stating plainly what it’s for and what it cannot do.
The two questions it answers
Downstream: blast radius. You want to drop a column, change a type, refactor a model. Lineage tells you every table, dashboard, and consumer that transitively depends on the thing you’re about to touch — turning “I hope nothing uses this” into a list. This is impact analysis, and it’s the single most practical use of lineage: changes stop being archaeology-plus-prayer.
Upstream: root cause. A dashboard number looks wrong. Lineage gives you the path to walk backwards — which model fed it, which staging table fed that, which source it came from — so you can find where the wrongness entered instead of guessing across the whole platform. When a pipeline isn’t idempotent and a re-run doubled some rows three hops upstream, lineage is how you find the hop.
Add the audit case — regulations that require showing where personal or financial data flows — and that’s the honest, complete list of what lineage is for.
Table-level vs column-level
A distinction that matters more than vendors let on. Table-level lineage says
“model B reads from table A” — easy to capture, and too coarse for the questions
above: knowing a dashboard reads from a 200-column table doesn’t tell you whether
your column matters to it. Column-level lineage traces individual fields
through every transformation, and it’s the resolution at which impact analysis
actually works. It’s also much harder to extract — every SELECT *, every template,
every dynamic query is a place column tracking goes dark. When evaluating tooling,
this is the question to press on.
As for where lineage comes from: it’s either parsed from your transformation code and SQL, observed from query logs at runtime, or documented by hand. The first two can stay current automatically. The third is stale by next sprint — a lineage diagram maintained manually is a drawing of how the pipeline used to work.
What lineage cannot do
Here’s the part the demos skip. Lineage is descriptive, not corrective. It maps the structure you have; it does not improve it.
A lineage graph of a swamp is a very accurate map of a swamp. Lineage tells you the blast radius of a change — it does nothing to make the radius smaller.
If everything depends on everything, lineage will render that tangle beautifully, and every impact analysis will return “all of it.” The fixes for that are architectural — deliberate layers, contracts at the boundaries, defined meaning — not observational.
Lineage also tells you what depends on a dataset, but not who answers for it. A graph without owners attached is trivia: you can see that thirty things break, but there’s no one to call. This is the same hole underneath most data-quality programs — the org-chart problem — and lineage doesn’t fill it; it just makes the unowned middle of your platform visible. Useful, but only if someone then assigns the names.
And one forward-looking note: as AI systems start consuming your data and answering questions from it, lineage becomes the provenance trail behind those answers — the only way to audit what an AI-served number was actually computed from. Another place where an old discipline quietly becomes load-bearing.
When to invest
You need real lineage when the platform has outgrown any one person’s head: enough models, layers, and consumers that “what breaks if I change this?” can’t be answered from memory, or a compliance regime that demands the trail. Below that threshold, a disciplined layered structure and tidy transformation code are your lineage — readable directly from the repo.
When you do invest: prefer lineage that’s extracted automatically from code and logs, insist on column-level where it counts, and treat the graph as an input to ownership and architecture decisions rather than an outcome in itself. Lineage is the map. The map is genuinely useful. But nobody ever fixed a city by mapping it — and the teams that get value from lineage are the ones who treat the map as the beginning of the work, not the deliverable.
Essays by email
One new essay on data architecture, straight to your inbox. No noise, unsubscribe anytime.