Hacker News new | past | comments | ask | show | jobs | submit login

Except no one wants content addressed data - because if you knew what it was you wanted, then you would already have stored it. The web as we know it is an index - it's a way to discover that data is available and specifically we usually want the latest data that's available.

AI scrapers aren't trying to find things they already know exist, they're trying to discover what they didn't know existed.




Yes, for the reasons you describe, you can't be both a useful web-like protocol and also 100% immutable/hash-linked.

But there's a lot middle ground to explore here. Loading a modern web page involves making dozens of requests to a variety of different servers, evaluating some javascript, and then doing it again a few times, potentially moving several Mb of data. The part people want, the thing you don't already know exist, it's hidden behind that rather heavy door. It doesn't have to be that way.

If you already know about one thing (by its cryptographic hash, say) and you want to find out which other hashes it's now associated with--associations that might not have existed yesterday--that's much easier than we've made it. It can be done:

- by moving kB not Mb, we're just talking about a tuple of hashes here, maybe a public key and a signature

- without placing additional burden on whoever authored the first thing, they don't even have to be the ones who published the pair of hashes that your scraper is interested in

Once you have the second hash, you can then reenter immutable-space to get whatever it references. I'm not sure if there's already a protocol for such things, but if not then we can surely make one that's more efficient and durable than what we're doing now.


But we already have HEAD requests and etags.

It is entirely possible to serve a fully cached response that says "you already have this". The problem is...people don't implement this well.


People don't implement them well because they're overburdened by all of the different expectations we put on them. It's a problem with how DNS forces us to allocate expertise. As it is, you need some kind of write access on the server whose name shows up in the URL if you want to contribute to it. This is how globally unique names create fragility.

If content were handled independently of server names, anyone who cares to distribute metadata for content they care about can do so. One doesn't need write access, or even to be on the same network partition. You could just publish a link between content A and content B because you know their hashes. Assembling all of this can happen in the browser, subject to the user's configs re: who they trust.


> because if you knew what it was you wanted, then you would already have stored it.

"Content-addressable" has a broader meaning than what you seem to be thinking of -- roughly speaking, it applies if any function of the data is used as the "address". E.g., git commits are content-addressable by their SHA1 hashes.


But when you do a "git pull" you're not pulling from someplace identified by a hash, but rather a hostname. The learning-about-new-hashes part has to be handled differently.

It's a legit limitation on what content addressing can do, but it's one we can overcome by just not having everything be content addressed. The web we have now is like if you did a `git pull` every time you opened a file.

The web I'm proposing is like how we actually use git--periodically pulling new hashes as a separate action, but spending most of our time browsing content that we already have hashes for.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: