When Your Identity Is Your Content

There’s a design pattern hiding in plain sight across half the systems you use daily. Git, Docker, IPFS, CDN asset pipelines, and LLM prompt caches all share the same fundamental architecture: they use the content itself as the identity. And they all run into the same consequence: when your name is derived from what you contain, you can’t change without becoming someone else.

The Pattern¶

In a content-addressed system, every object is identified by a hash of its content. You don’t choose the name — the content determines it. Two different machines, given the same bytes, will compute the same hash and arrive at the same identity.

This is elegant. It gives you deduplication for free — same content, same hash, store once. It gives you integrity verification — if the hash matches, the content is what it claims to be. And it gives you immutability by construction — you can’t modify an object in place because any modification produces a new hash, which means a new identity.

That last property is where things get interesting. Because it means you can never update anything. You can only create a replacement and point to it.

Git: Every Byte Creates a Fork¶

Git is a content-addressed filesystem masquerading as a version control system¹. Every object — blob, tree, commit — is identified by a SHA-1 hash of its content. Changing one byte in one file produces a new blob hash, which cascades upward: the tree hash changes, the commit hash changes, and everything downstream points to different objects.

This is the reason you can’t “amend” a commit without rewriting history. The command is literally called rebase — you’re rebuilding the entire chain from the point of divergence. The old commits still exist (briefly, until garbage collection), and the new ones have completely different identities. You didn’t edit history. You created a parallel history and abandoned the original.

Git’s content-addressing also means the same repository, cloned to two machines, is verifiable by comparing a single hash. If the HEAD commit matches, every object reachable from it must be identical — because each hash commits to the hash of its children, all the way down. One hash at the root validates the entire tree. This is the property that makes Git’s distributed model work: no central authority is needed to verify integrity. The math handles it².

Docker: Two Systems in a Trench Coat¶

Docker is a more complicated case. Image storage is genuinely content-addressed — every layer in a registry is identified by a sha256: digest of its content. You can reference ubuntu@sha256:abc123... and be guaranteed bit-for-bit identity regardless of what tag someone attaches to it³.

But Docker’s build cache — the mechanism that skips unchanged layers during docker build — is a hybrid. COPY and ADD instructions compute checksums from file content, which is true content-addressing. But RUN instructions match on the command string, not the output. RUN apt-get update will hit cache even if the upstream packages have changed entirely, because Docker only compares the string "apt-get update", not what it produces⁴.

This gap catches people constantly. The Dockerfile instruction looks like a function call, but the cache treats it like a name. Same name, different output — but the cache doesn’t know.

The cascading invalidation, though, is pure content-addressing logic: once any layer fails its cache check, every subsequent layer is rebuilt from scratch, even if those later instructions haven’t changed at all. One broken link in the chain breaks the whole chain.

CDNs: Naming Things to Avoid Renaming Them¶

Phil Karlton reportedly said there are only two hard things in computer science: cache invalidation and naming things. CDN asset pipelines solved both at once — by making the name be the cache⁵.

When you run a modern JavaScript bundler, your output files look like app.a3f9d2c.js — the content hash baked into the filename. Change the code, get a new hash, get a new filename. The CDN never needs to invalidate anything because the URL itself has changed. You set Cache-Control: max-age=31536000, immutable and walk away. One year. No invalidation needed. The old URL serves the old content forever (or until it expires), and the new URL serves the new content. They coexist rather than collide⁶.

This is content-addressing applied to the naming layer. The build tool creates the name from the content, and the CDN caches by name, which means it’s implicitly caching by content. The cleverness is that the CDN doesn’t need to know this — it just sees unique URLs. The semantic relationship between content and identity is enforced by the build step, not the infrastructure.

IPFS: The Purest Form¶

IPFS takes content-addressing to its logical extreme. Every file is identified by a CID — a Content Identifier derived from a hash of the data⁷. There are no URLs, no server locations, no authority that controls what a name means. The address is the content, mathematically.

This makes IPFS inherently immutable. You can’t “update” a file at a CID — that’s meaningless. You can create a new version with new content, which gets a new CID, which is a new address. The old version still exists at its old address for as long as any node pins it.

IPFS solves mutability the same way everyone else does: with a pointer layer. IPNS (InterPlanetary Name System) lets you publish a mutable record — signed with your private key — that points to the current CID. The pointer has a stable identity (derived from your public key). The content it points to changes⁸. The IPFS documentation makes the analogy explicit: CIDs are like git commit hashes, IPNS names are like git tags.

LLM Prompt Caches: Content-Addressing in Conversation¶

The example that made me notice the pattern is the most recent. Anthropic’s prompt caching for Claude computes a hash of the conversation prefix — system prompt, tool definitions, and messages — up to a marked breakpoint⁹. If a subsequent request produces the same prefix hash, the cached portion isn’t re-processed. Cache reads cost 10% of the normal input token price. The economics are dramatic.

But the cache key is a cumulative hash of content. Change any block at or before the breakpoint — add a system instruction, modify a tool description, insert a message — and the hash changes. The entire cached prefix is invalidated. One byte anywhere in the prefix means the system re-reads everything from scratch.

The documentation is explicit: “100% identical prompt segments” are required for a cache hit. This isn’t prefix matching — it’s content-addressing. The cache key literally is a hash of the content.

In practice, this creates an invisible constraint on how you design conversations with the API. Every decision about message order, tool configuration, and system prompt wording is frozen once caching kicks in. Refactoring your tool descriptions? Cache break. Reordering messages after a context window trim? Cache break. Adding logging metadata to system prompts between sessions? Cache break. The content is the identity, and identity is not negotiable.

The Universal Interface¶

Here’s what I find striking: every one of these systems independently converged on the same two-layer solution.

Layer 1: Immutable content-addressed store. Objects are identified by their content hash. They cannot be modified in place. Same content always produces the same identity.

Layer 2: Mutable pointer. A stable name — a branch, a tag, an IPNS record, a URL, a Docker tag — that can be updated to point at different content over time.

Git branches are mutable pointers to immutable commit hashes. Docker tags are mutable pointers to immutable layer digests. IPNS records are mutable pointers to immutable CIDs. CDN HTML pages are mutable documents pointing to immutable hashed asset URLs. Even LLM prompt cache systems silently manage this: your conversation grows (the pointer advances), but each cached prefix snapshot is immutable, identified by its hash.

The Nix package manager¹⁰ might be the most aggressive expression of this pattern: every package is stored at /nix/store/<hash>-<name>-<version>/, where the hash includes the entire dependency tree. Install the same package with the same dependencies and you get the same path. Change one dependency version, anywhere in the tree, and you get an entirely new store path. The same program, with the same source code, compiled with a slightly different library, has a different identity. Because the identity includes everything it depends on.

What You Gain, What You Lose¶

Content-addressing buys you three things that are expensive to get any other way:

Integrity without authority. Anyone can verify an object by recomputing its hash. No certificate authority, no central server, no trust hierarchy required. The verification is in the math.

Deduplication without coordination. Two systems that independently create the same content arrive at the same identity. They can discover they have the same object without communicating beforehand.

Immutability without enforcement. You don’t need access controls to prevent modification — modification is structurally impossible. There’s nothing to modify. Any change creates a new object.

What you lose is the intuition that things can change in place. In most human mental models, an object has a stable identity and a mutable state. My car is still my car after an oil change. A document is still the same document after a revision. Content-addressing breaks this. After the revision, it’s a different document — literally, mathematically, a different object with a different address. The old document still exists, unchanged, at its old address. You haven’t edited anything. You’ve forked reality.

For systems designed around this model, that’s fine — it’s the point. For humans interacting with those systems, it’s a perpetual source of confusion. Why does force-pushing rewrite history? Because it has to — the old history, identified by its content, can never be the new history. Why does changing a Dockerfile line rebuild everything after it? Because each layer’s identity includes its parent’s identity. Why does my prompt cache break when I add a debugging header to the system message? Because the cache key is a hash of the content, and the content changed.

The frozen-decision problem isn’t a bug in any of these systems. It’s the cost of the guarantee they provide. Identity derived from content is incorruptible — but it’s also unforgiving. Every change is a new beginning. Nothing is revised; everything is replaced.

You can’t update who you are. You can only become someone new and hope the pointers follow.

Linus Torvalds’ original description called Git a “content-addressable filesystem” before it was a VCS. See Git Internals - Git Objects in the official Git documentation. ↩
Git’s SHA-1 has been hardened since Git v2.13.0 (2017) against the SHAttered collision attack. The SHA-256 migration is documented but SHA-1 remains the default as of Git 2.54 (2026). ↩
Docker images use OCI content-addressable storage with sha256: digests. See the OCI Image Specification for the formal definition. ↩
Docker build cache invalidation rules: COPY/ADD instructions check file content checksums, but RUN instructions match only the command string. See Docker Build Cache Invalidation for the official behavior. ↩
Google Cloud CDN documentation explicitly recommends appending content hashes to URLs as a cache-busting strategy that “aligns with modern web development workflows.” ↩
The immutable Cache-Control directive was formalized in RFC 8246 (2017) specifically for content-addressed URLs where the resource will never change at that address. ↩
IPFS official documentation on Content Addressing describes CIDs and the guarantee that “any difference in the content will produce a different CID.” ↩
IPFS documentation on IPNS explains the mutable pointer layer and draws the explicit analogy: “CIDs are like commit hashes in Git” while “IPNS names are like tags in git.” ↩
Anthropic’s Prompt Caching documentation specifies that cache hits require “100% identical prompt segments” and that the system writes “a hash of the prefix” as the cache key. ↩
The Nix package manager stores derivations at content-addressed paths. See How Nix Works for the hash-based store path architecture. ↩