Hacker News new | past | comments | ask | show | jobs | submit login

We all agree that AI crawlers are a big issue as they don't respect any established best practices, but we rarely talk about the path forward. Scraping has been around for as long as the internet, and it was mostly fine. There are many very legitimate use cases for browser automation and data extraction (I work in this space).

So what are potential solutions? We're somehow still stuck with CAPTCHAS, a 25 years old concept that wastes millions of human hours and billions in infra costs [0].

How can enable beneficial automation while protecting against abusive AI crawlers?

[0] https://arxiv.org/abs/2311.10911




Proof-of-work works in terms of preventing large-scale automation.

As for letting well behaved crawlers in, I've had an idea for something like DKIM for crawlers. Should be possible to set up a fairly cheap cryptographic solution that enables crawlers a persistent identity that can't be forged.

Basically put a header containing first a string including today's date, the crawler's IP, and a ___domain name, then a cryptographic signature of the string. The ___domain has a TXT record with a public key for verifying the identity. It's cheap because you really only need to verify the string it once on the server side, and the crawler only needs to regenerate it once per day.

With that in place, crawlers can crawl with their reputation at stake. The big problem with these rogue scrapers are that they're basically impossible to identify or block, which means they don't have any incentives to behave well.


> Proof-of-work works in terms of preventing large-scale automation.

It wouldn't work to prevent the type of behavior shown in a title story


My pet peeve is that using the term "AI crawler" for this conflates things unnecessarily. There's some people who are angry at it due to anti-AI bias and not wishing to share information, while there are others who are more concerned about it due to the large amount of bandwidth and server overloading.

Not to mention that it's unknown if these are actually from AI companies, or from people pretending to be AI companies. You can set anything as your user agent.

It's more appropriate to mention the specific issue one haves about the crawlers, like "they request things too quickly" or "they're overloading my server". Then from there, it is easier to come to a solution than just "I hate AI". For example, one would realize that things like Anubis have existed forever, they are just called DDoS protection, specifically those using proof-of-work schemes (e.g. https://github.com/RuiSiang/PoW-Shield).

This also shifts the discussion away from something that adds to the discrimination against scraping in general, and more towards what is actually the issue: overloading servers, or in other words, DDoS.


It's become unbearable in the "AI era". So it's appropriate to blame AI for it, ib my eyes. Especially since so much defense is based aroind training LLMs.

It's just like how not all Ddoss's are actually hackers or bots. Sometimes a server just can't take the traffic of a large site flooding in. But the result is the same until something is investigated.


It's not a coincidence that this wasn't a major problem until everybody and their dog started trying to build the next great LLM.


Blame the "AI" companies for that. I am glad the small web is pushing hard against these scrapers, with the rise of Anubis as a starting point


> Blame the "AI" companies for that. I am glad the small web is pushing hard towards these scrapers, with the rise of Anubis as a starting point

Did you mean "against"?


Corrected, thanks


The best solution I've seen is to hit everyone with a proof of work wall and whitelist the scrapers that are welcome (search engines and such).

Running SHA hash calculations for a second or so once every week is not bad for users, but with scrapers constantly starting new sessions they end up spending most of their time running useless Javascript, slowing the down significantly.

The most effective alternative to proof of work calculations seems to be remote attestation. The downside is that you're getting captchas if you're one of the 0.1% who disable secure boot and run Linux, but the vast majority of web users will live a captcha free life. This same mechanism could in theory also be used to authenticate welcome scrapers rather than relying on pure IP whitelists.


The issue is that it would require normal user to also do the same, which is suboptimal from a privacy point of view.


I wrote an article about a possible proof of personhood solution idea: https://mjaseem.github.io/tech/2025/04/12/proof-of-humanity.....

The broad idea is to use zero knowledge proofs with certification. It sort of flips the public key certification system and adds some privacy.

To get into place, the powers in charge need to sway.


> So what are potential solutions?

It won't fully solve the problem, but with the problem relatively identified, you must then ask why people are engaging in this behavior. Answer: money, for the most part. Therefore, follow the money and identify the financial incentives driving this behavior. This leads you pretty quickly to a solution most people would reject out-of-hand: turn off the financial incentive that is driving the enshittification of the web. Which is to say, kill the ad-economy.

Or at least better regulate it while also levying punitive damages that are significant enough to both disuade bad-actors and encourage entities to view data-breaches (or the potential therein) and "leakage[0]" as something that should actually be effectively secured against. Afterall, there are some upsides to the ad-economy that, without it, would present some hard challenges (eg, how many people are willing to pay for search? what happens to the vibrant sphere of creators of all stripes that are incentivized by the ad-economy? etc).

Personally, I can't imagine this would actually happen. Pushback from monied interests aside, most people have given up on the idea of data-privacy or personal-ownership of their data, if they ever even cared in the first place. So, in the absence of willing to do do something about the incentive for this maligned behavior, we're left with few good options.

0: https://news.ycombinator.com/item?id=43716704 (see comments on all the various ways people's data is being leaked/leached/tracked/etc)


CAPTCHAS are also quickly becoming irrelevant / not enough. Fingerprint based approaches seem to be the only realistic way forward in the cat / mouse game


I hate this but I suspect a login-only deanonymised web (made simple with chrome and WEI!) is the future. Firefox users can go to hell.


I'm still surprised by people everyday, after all these years. This is one of those times. Crazy how anyone would ever want a single point of identifying everything you do.


I don't want this - It's the exact opposite of what I want.


We won't.


To elaborate (if anyone sees this) I use Firefox on Linux. I don't LIKE this future! I just think it's where the web is headed.


But people don’t interact with your website anymore; they as an AI. So the AI crawler is a real user.

I say we ask Google Analytics to count an AI crawler as a real view. Let’s see who’s most popular.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: