The broken thing about the web is that in order for data to remain readable, a unique sysadmin somewhere has to keep a server running in the face of an increasingly hostile environment.
If instead we had a content addressed model, we could drop the uniqueness constraint. Then these AI scrapers could be gossiping the data to one another (and incidentally serving it to the rest of us) without placing any burden on the original source.
Having other parties interested in your data should make your life easier (because other parties will host it for you), not harder (because now you need to work extra hard to host it for them).
Assuming the right incentives can be found to prevent widespread leeching, a distributed content-addressed model indeed solves this problem, but introduces the problem of how to control your own content over time. How do you get rid of a piece of content? How do you modify the content at a given URL?
I know, as far as possible it's a good idea to have content-immutable URLs. But at some point, I need to make www.myexamplebusiness.com show new content. How would that work?
As for how to get rid of a piece of content... I think that one's a lost cause. If the goal is to prevent things that make content unavailable (e.g. AI scrapers) then you end up with a design that prevents things that makes content unavailable (e.g. legitimate deletions). The whole point is that you're not the only one participating in propagating the content, and that comes with trade-offs.
But as for updating, you just format your URLs like so: {my-public-key}/foo/bar
And then you alter the protocol so that the {my-public-key} part resolves to the merkle-root of whatever you most recently published. So people who are interested in your latest content end up with a whole new set of hashes whenever you make an update. In this way, it's not 100% immutable, but the mutable payload stays small (it's just a bunch of hashes) and since it can be verified (presumably there's a signature somewhere) it can be gossiped around and remain available even if your device is not.
You can soft-delete something just by updating whatever pointed to it to not point to it anymore. Eventually most nodes will forget it. But you can't really prevent a node from hanging on to an old copy if they want to. But then again, could you ever do that? Deleting something on on the web has always been a bit of a fiction.
True in the absolute sense, but the effect size is much worse under the kind of content-addressable model you're proposing. Currently, if I download something from you and you later delete that thing, I can still keep my downloaded copy; under your model, if anyone ever downloads that thing from you and you later delete that thing, with high probability I can still acquire it at any later point.
As you say, this is by design, and there are cases where this design makes sense. I think it mostly doesn't for what we currently use the web for.
You could only later get the thing if you grabbed its hash while it was still available. And you could only reliably resolve that hash later if somebody (maybe you) went out of their way to pin the underlying data. Otherwise nodes would forget rather quickly, because why bother keep around unreferenced bits?
It's the same functionality you get with permalinks and sites like archive.org--forgotten unless explicitly remembered by anybody, dynamic unless explicitly a permalink. It's just built into the protocol rather than a feature to be inconsistently implemented over and over by many separate parties.
Except no one wants content addressed data - because if you knew what it was you wanted, then you would already have stored it. The web as we know it is an index - it's a way to discover that data is available and specifically we usually want the latest data that's available.
AI scrapers aren't trying to find things they already know exist, they're trying to discover what they didn't know existed.
Yes, for the reasons you describe, you can't be both a useful web-like protocol and also 100% immutable/hash-linked.
But there's a lot middle ground to explore here. Loading a modern web page involves making dozens of requests to a variety of different servers, evaluating some javascript, and then doing it again a few times, potentially moving several Mb of data. The part people want, the thing you don't already know exist, it's hidden behind that rather heavy door. It doesn't have to be that way.
If you already know about one thing (by its cryptographic hash, say) and you want to find out which other hashes it's now associated with--associations that might not have existed yesterday--that's much easier than we've made it. It can be done:
- by moving kB not Mb, we're just talking about a tuple of hashes here, maybe a public key and a signature
- without placing additional burden on whoever authored the first thing, they don't even have to be the ones who published the pair of hashes that your scraper is interested in
Once you have the second hash, you can then reenter immutable-space to get whatever it references. I'm not sure if there's already a protocol for such things, but if not then we can surely make one that's more efficient and durable than what we're doing now.
People don't implement them well because they're overburdened by all of the different expectations we put on them. It's a problem with how DNS forces us to allocate expertise. As it is, you need some kind of write access on the server whose name shows up in the URL if you want to contribute to it. This is how globally unique names create fragility.
If content were handled independently of server names, anyone who cares to distribute metadata for content they care about can do so. One doesn't need write access, or even to be on the same network partition. You could just publish a link between content A and content B because you know their hashes. Assembling all of this can happen in the browser, subject to the user's configs re: who they trust.
> because if you knew what it was you wanted, then you would already have stored it.
"Content-addressable" has a broader meaning than what you seem to be thinking of -- roughly speaking, it applies if any function of the data is used as the "address". E.g., git commits are content-addressable by their SHA1 hashes.
But when you do a "git pull" you're not pulling from someplace identified by a hash, but rather a hostname. The learning-about-new-hashes part has to be handled differently.
It's a legit limitation on what content addressing can do, but it's one we can overcome by just not having everything be content addressed. The web we have now is like if you did a `git pull` every time you opened a file.
The web I'm proposing is like how we actually use git--periodically pulling new hashes as a separate action, but spending most of our time browsing content that we already have hashes for.
Can you point me at what you mean? I'm not immediately finding something that indicates that it is not fit for this use case. The fact that bad actors use it to resist those who want to shut them down is, if anything, an endorsement of its durability. There's a bit of overlap between resisting the AI scrapers and resisting the FBI. You can either have a single point of control and a single point of failure, or you can have neither. If you're after something that's both reliable and reliably censorable--I don't think that's in the cards.
That's not to say that it is a ready replacement for the web as we know it. If you have hash-linked everything then you wind up with problems trying to link things together, for instance. Once two pages exist, you can't after-the-fact create a link between them because if you update them to contain that link then their hashes change so now you have to propagate the new hash to people. This makes it difficult to do things like have a comments section at the bottom of a blog post. So you've got to handle metadata like that in some kind of extra layer--a layer which isn't hash linked and which might be susceptible to all the same problems that our current web is--and then the browser can build the page from immutable pieces, but the assembly itself ends up being dynamic (and likely sensitive to the users preference, e.g. dark mode as a browser thing not a page thing).
But I still think you could move maybe 95% of the data into an immutable hash-linked world (think of these as nodes in a graph), the remaining 5% just being tuples of hashes and pubic keys indicating which pages are trusted by which users, which ought to be linked to which others, which are known to be the inputs and output of various functions, and you know... structure stuff (these are our graph's edges).
The edges, being smaller, might be subject to different constraints than the web as we know it. I wouldn't propose that we go all the way to a blockchain where every device caches every edge, but it might be feasible for my devices to store all of the edges for the 5% of the web I care about, and your devices to store the edges for the 5% that you care about... the nodes only being summoned when we actually want to view them. The edges can be updated when our devices contact other devices (based on trust, like you know that device's owner personally) and ask "hey, what's new?"
I've sort of been freestyling on this idea in isolation, probably there's already some projects that scratch this itch. A while back I made a note to check out https://ceramic.network/ in this capacity, but I haven't gotten down to trying it out yet.
I figure we'd create that incentive by configuring our devices to only talk to devices controlled by people we trust. If they want the data at all, they have to gain our trust, and if they want that, they have to seed the data. Or you know, whatever else the agreement ends up being. Maybe we make them pay us.
If instead we had a content addressed model, we could drop the uniqueness constraint. Then these AI scrapers could be gossiping the data to one another (and incidentally serving it to the rest of us) without placing any burden on the original source.
Having other parties interested in your data should make your life easier (because other parties will host it for you), not harder (because now you need to work extra hard to host it for them).