You arrive from Bedrock knowing the shape of things: every artifact is
one row in the blob table, and each row has a
content column. It is natural to picture the file's bytes
sitting in that column, waiting. They are not. Open a row and you find
something stranger, and far cleverer — the reason a 24-year
project fits in a 68-megabyte file.
1What the content column really holds.
In Bedrock, the blob schema annotated one column as
“Compressed content of this record.” Two separate truths
hide inside that short phrase, and the Vault is the district that
unpacks both.
The first: the bytes are zlib-compressed. A plain
SELECT content FROM blob hands you a compressed lump,
not readable text — something must inflate it first.
The second is the surprising one: the column often does not hold the artifact at all. It holds a delta — a description of how to build the artifact out of a different artifact. Fossil's own schema comment, the one we read in Bedrock, said exactly this and we walked straight past it: “it might hold the full text of the record or it might hold a delta.” Time to take it seriously.
2Most artifacts are deltas.
Bedrock's schema blueprint showed a second table sitting right next
to blob: the delta table, just two columns
— rid and srcid. Its job is to record
a single fact: “blob rid is stored as a delta
against blob srcid.” If a blob has a row in
delta, its content is a delta; if it does
not, its content is the whole (compressed) artifact.
Back in the sqlite3 shell from Bedrock, the scale of it
is one query away. Step through:
Read those last two numbers together — they are the whole point of the Vault:
(every version of everything)
Eight and a half gigabytes of logical content — the full text of every file at every revision across 24 years — held in 38 megabytes. Two mechanisms stack to do this: zlib squeezes each stored lump, and delta-encoding means most lumps are tiny descriptions of change rather than whole files. 88% of all artifacts are stored as deltas.
One more wrinkle worth holding onto: a delta's source can itself be a delta. Sources chain. Artifact 66 is a delta against 294; 294 might be a delta against something else again. That chain is the puzzle Street 4 has to solve.
3A delta is a little program.
“Delta” can sound vague — some fuzzy notion of a
difference. It is not vague at all. A Fossil delta is a precise,
tiny program: a sequence of instructions for building the
target artifact out of a source artifact. Per Fossil's own
delta_format.wiki, it has three parts — a
header (the size of the target), a
segment-list (the instructions), and a
trailer (a checksum to verify the result).
The segment-list uses just two kinds of instruction:
- Copy — “copy N bytes starting at offset in the source.” This is where the saving comes from: a long shared run becomes a few-byte instruction.
- Insert — “insert these N brand-new literal bytes.” Only the genuinely new material is spelled out in full.
Here is a delta in miniature. The source is one short string; the delta is four instructions; the target is what running them produces. Click an instruction to see what it contributes.
Blue = copied from the source · amber = newly inserted.
Notice what the delta never contains: the words “the quick” and “fox.” They already exist in the source, so the delta just points at them. For a one-line edit to a large file, the delta is two copy instructions (the unchanged bulk, before and after) and one small insert (the new line) — a few dozen bytes to encode a change in a hundred-kilobyte file. Multiply across 58,394 deltas and you get the 222× from Street 2.
On disk the instructions are written compactly — a copy as
length@offset,, an insert as length:bytes,
the whole thing topped with the target size and tailed with the
checksum. The shape, though, is exactly the four rows above.
4Reconstruction.
A delta is only useful if something can run it. When any
part of Fossil asks for the real content of an artifact, the
function that answers is content_get in
content.c. Conceptually it faces one question and
handles two cases.
Is this artifact a delta? If not, the job is easy — fetch the row, unzip it, done. The small helper for that step is worth meeting first:
{ } Fetching and uncompressing one blob src/content.c · lines 218–230
content is zlib-compressed, so it is inflated before anyone sees it. Note the SQL filter size>=0 — that quietly skips phantoms (Bedrock's size = -1 rows).▸ Click static or * for the C explained.
But if the artifact is a delta? Then its source
might be a delta too, and its source as well. So
content_get first walks up the chain —
delta → source → source — until it reaches a real,
whole base artifact. It reconstructs that base, then applies the
deltas back down the chain, each one rebuilding the next,
until the artifact you actually asked for emerges.
Step inside content_get —
a line-by-line interior ▸
{ } Walking the delta chain src/content.c · lines 262–310, inside content_get()
delta_source_rid answers the one question: returns the source rid, or 0 if this artifact is a whole base. Zero → the easy path above. Otherwise, the chain walk begins.a[] of rids: this artifact, then its source, then its source's source… The list is grown by hand as it fills.fossil_panic is a guard: if the chain somehow loops on itself, fail loudly rather than spin forever.a[n] is the base. content_get calls itself to reconstruct it — a function calling itself is called recursion.blob_delta_apply runs each delta-program against what we have so far, producing the next artifact. Every 8th step is cached so future lookups start closer.▸ Click * or fossil_malloc for the C explained.
That is the Vault's machinery end to end: store almost everything as a delta against something else; to read an artifact back, walk its chain to a base and replay the deltas. The cost is a little work on every read; the prize is the 222×.
Leaving the Vault, you now know
- A
blobrow'scontentis always zlib-compressed, and usually a delta rather than the whole artifact — 88% of them. - The
deltatable records which blob is a delta and against which source; sources chain. - A delta is a small program — copy ranges from a source, insert literal bytes — and that is where the 222× compression comes from.
content_getreconstructs any artifact: walk the delta chain up to a base, then apply deltas back down.
The Vault holds artifacts as a flat pool of immutable, hash-named content. District 3 asks the next question: how do those artifacts point at each other to form a history — check-ins, parents, branches?