Hacker News new | past | comments | ask | show | jobs | submit login
Go ahead and block AI web crawlers (coryd.dev)
26 points by cdme on March 2, 2024 | hide | past | favorite | 22 comments



One advantage Google has here over everyone else is they can tie it to search crawling and say “sure, you can block Gemini but only by blocking GoogleBot”. That’s a tradeoff very few would make, whereas there’s not much to lose by blocking OpenAI. This gives Google a massive advantage over competitors in terms of training data.


How does one differentiate between "AI web crawlers" and "non-AI web crawlers".

What faith can be placed in User-Agent strings. The contents of this header have been faked since the birth of the www in the early 90s.

How does anyone know what someone will do with the data they have crawled. There are no transparency requirements, there are no legally-enforceable agreements. There are only IP addresses and User-Agent headers.

No one can look at a User-Agent header or an IP address and conclude, "I know what someone will do with the data I let them access because this header or address confirms it". Those accessing the data could use it for any purpose or transfer it to someone else to use for any purpose. Unless the website operator has a binding agreement with the organisation doing the crawling, any assumptions about future behaviour made on basis of a header or IP address offer no control whatsoever.

Perhaps an IP address could be used to conclude the HTTP requests are being sent by Company X, but the address does not indicate what Company X will do with the data it collects. Company X can do whatever it wants. It does not need to tell anyone what it is doing; it could make up a story about what it is doing that conveniently conceals facts.

These so-called "tech" companies that are training "AI" are secretive and non-transparent. Further, they will lie when it suits their interests. They do not ask for permission, they only ask for forgiveness. They are strategic and unfortunately deceptive in what they tell the public. "Trust" at own risk.

Although it may be useless as a means of controlling how crawling data is used, it still makes sense to me to put something in robots.txt to indicate there is no consent given for crawling for the purpose of training "AI". Better would be to publish some explicit notice to the public that no consent is given to use data from the website to train "AI".

Put the restrictions in a license. Let the so-called "tech" companies assent to that license. Then, when the evidence of unauthorised data use becomes availalble, enforce the license.


Block any crawler that doesn't come from known ranges that you whitelist.


A good example of why this won’t work that’s 10x more extreme and has legal consequences… in theory…

Didn’t the FBI admit that it just bought U.S. citizen data instead of spying on them? Almost two years in a row?

This has been happening non-stop for over 20 years without any repercussions.

I don’t think robots.txt will make a difference. If your site is internet facing, nothing will prevent it from being crawled and scraped.

If anything, i can almost guarantee AI crawling will be tired to SEO if it’s not already.


Admirable but ultimately inconsequential.


curious, will the robots.txt be really honored? maybe legal issue if not?


robots.txt being an honor system has always been a bit weird and search engine treatment is odder still.

Google will still index your disallowed pages but refuse to show a description since they are couldn't read the page. The premise being that Google could have discovered your URLs from other pages and they don't consider a robots.txt disallow to indicate that you don't want them in search, just that you don't want Google to directly read the page.

To actually disallow indexing of a page a page you need to allow it in robots and then add a noindex to the page itself.

No one has really explained how these AI agents will treat robots.txt. Does it just mean that Bard/Gemini won't directly read the page but might somehow incorporate it into a dataset?

For a test I asked Gemini "What is the URL of the String api documentation for Java 6?" which is disallowed by robots.txt.

The response was: Unfortunately, Oracle does not maintain separate API documentation for older versions of Java like Java 6. The official documentation website (https://docs.oracle.com/javase/7/docs/api/) only provides documentation for the latest versions.

Oracle has archieved some of the older Java API documentation but 6 is still available at https://docs.oracle.com/javase/6/docs/api/index.html. Funny enough the Java 7 API docs which Gemini linked too is also disallowed but maybe that happened after the model was trained.


OpenAI documents how their crawler honors robots.txt.

https://platform.openai.com/docs/gptbot


After they ingested everything


We don't actually know what they trained on. If we want to trust them, they've always honored robots.txt, because why wouldn't they? On the other hand, if we want to believe they're as capricious as we want to make them out to be, they've got a copy of LibGen, and every single torrent out there to train it in on, on top of using robots.txt to find what you don't want crawled. robots.txt is great - it tells you where your evil scraper should go to get the good stuff.


Gpt-4 was released to the public on March 14th last year. They told people how to block their agent in August.


yes but

    User agent: *


I don't see how a robots.txt itself would be legally binding, but I guess you could make an argument for the terms of service? Would love to know more about this as well.


It should be — it’s an accepted standard and would make their behavior even more dubious should they ignore it.


It will not be honored. The ones that honor it will lose to the ones that do.


So we should make it illegal not to honor it then, if the incentives are to engage in anti-social behavior.


Given that Google has honored robots.txt for many years, not sure this is true. It mostly means that blocked sites might not show up on the biggest, most trustworthy platforms. If these AI platforms become a big way people consume content from the internet, authors will have an interest in being there.

Today, people make money from web traffic, and therefore want their sites indexed by search indices. If the same happens with AI eventually, authors will probably follow. What people get from having their sites ingested by AI is still unknown today — if the model doesn’t send traffic to your site, you can’t make ad money and can’t support your site (if it’s big). Data licensing that allows AI platforms to pay authors for useful content may be a solution someday


There are many entities out there crawling and scraping the web in non-compliant ways.


It seems quite appropriate that luddites will opt out of the AI training space. Hopefully it will lead to better quality output.


The AI models would not exist without data the so called luddites created.


The best content is not created by luddites. Creative curiousity is generally a necessity for great art.


I find that people who are willing to not take shortcuts are often the ones making better work, but that's just my opinion.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: