Go ahead and block AI web crawlers

nojs · on March 2, 2024

One advantage Google has here over everyone else is they can tie it to search crawling and say “sure, you can block Gemini but only by blocking GoogleBot”. That’s a tradeoff very few would make, whereas there’s not much to lose by blocking OpenAI. This gives Google a massive advantage over competitors in terms of training data.

1vuio0pswjnm7 · on March 2, 2024

How does one differentiate between "AI web crawlers" and "non-AI web crawlers".

What faith can be placed in User-Agent strings. The contents of this header have been faked since the birth of the www in the early 90s.

How does anyone know what someone will do with the data they have crawled. There are no transparency requirements, there are no legally-enforceable agreements. There are only IP addresses and User-Agent headers.

No one can look at a User-Agent header or an IP address and conclude, "I know what someone will do with the data I let them access because this header or address confirms it". Those accessing the data could use it for any purpose or transfer it to someone else to use for any purpose. Unless the website operator has a binding agreement with the organisation doing the crawling, any assumptions about future behaviour made on basis of a header or IP address offer no control whatsoever.

Perhaps an IP address could be used to conclude the HTTP requests are being sent by Company X, but the address does not indicate what Company X will do with the data it collects. Company X can do whatever it wants. It does not need to tell anyone what it is doing; it could make up a story about what it is doing that conveniently conceals facts.

These so-called "tech" companies that are training "AI" are secretive and non-transparent. Further, they will lie when it suits their interests. They do not ask for permission, they only ask for forgiveness. They are strategic and unfortunately deceptive in what they tell the public. "Trust" at own risk.

Although it may be useless as a means of controlling how crawling data is used, it still makes sense to me to put something in robots.txt to indicate there is no consent given for crawling for the purpose of training "AI". Better would be to publish some explicit notice to the public that no consent is given to use data from the website to train "AI".

Put the restrictions in a license. Let the so-called "tech" companies assent to that license. Then, when the evidence of unauthorised data use becomes availalble, enforce the license.

Grangar · on March 2, 2024

Block any crawler that doesn't come from known ranges that you whitelist.

s4mw1se · on March 2, 2024

A good example of why this won’t work that’s 10x more extreme and has legal consequences… in theory…

Didn’t the FBI admit that it just bought U.S. citizen data instead of spying on them? Almost two years in a row?

This has been happening non-stop for over 20 years without any repercussions.

I don’t think robots.txt will make a difference. If your site is internet facing, nothing will prevent it from being crawled and scraped.

If anything, i can almost guarantee AI crawling will be tired to SEO if it’s not already.

besfriendt · on March 2, 2024

Admirable but ultimately inconsequential.

nittanymount · on March 2, 2024

curious, will the robots.txt be really honored? maybe legal issue if not?

matt_heimer · on March 2, 2024

robots.txt being an honor system has always been a bit weird and search engine treatment is odder still.

Google will still index your disallowed pages but refuse to show a description since they are couldn't read the page. The premise being that Google could have discovered your URLs from other pages and they don't consider a robots.txt disallow to indicate that you don't want them in search, just that you don't want Google to directly read the page.

To actually disallow indexing of a page a page you need to allow it in robots and then add a noindex to the page itself.

No one has really explained how these AI agents will treat robots.txt. Does it just mean that Bard/Gemini won't directly read the page but might somehow incorporate it into a dataset?

For a test I asked Gemini "What is the URL of the String api documentation for Java 6?" which is disallowed by robots.txt.

The response was: Unfortunately, Oracle does not maintain separate API documentation for older versions of Java like Java 6. The official documentation website (https://docs.oracle.com/javase/7/docs/api/) only provides documentation for the latest versions.

Oracle has archieved some of the older Java API documentation but 6 is still available at https://docs.oracle.com/javase/6/docs/api/index.html. Funny enough the Java 7 API docs which Gemini linked too is also disallowed but maybe that happened after the model was trained.

fragmede · on March 2, 2024

OpenAI documents how their crawler honors robots.txt.

https://platform.openai.com/docs/gptbot

__loam · on March 3, 2024

After they ingested everything

fragmede · on March 3, 2024

We don't actually know what they trained on. If we want to trust them, they've always honored robots.txt, because why wouldn't they? On the other hand, if we want to believe they're as capricious as we want to make them out to be, they've got a copy of LibGen, and every single torrent out there to train it in on, on top of using robots.txt to find what you don't want crawled. robots.txt is great - it tells you where your evil scraper should go to get the good stuff.

__loam · on March 3, 2024

Gpt-4 was released to the public on March 14th last year. They told people how to block their agent in August.

fragmede · on March 3, 2024

yes but

    User agent: *

ghxst · on March 2, 2024

I don't see how a robots.txt itself would be legally binding, but I guess you could make an argument for the terms of service? Would love to know more about this as well.

cdme · on March 2, 2024

It should be — it’s an accepted standard and would make their behavior even more dubious should they ignore it.

29athrowaway · on March 2, 2024

It will not be honored. The ones that honor it will lose to the ones that do.

__loam · on March 2, 2024

So we should make it illegal not to honor it then, if the incentives are to engage in anti-social behavior.

noahtallen · on March 2, 2024

Given that Google has honored robots.txt for many years, not sure this is true. It mostly means that blocked sites might not show up on the biggest, most trustworthy platforms. If these AI platforms become a big way people consume content from the internet, authors will have an interest in being there.

Today, people make money from web traffic, and therefore want their sites indexed by search indices. If the same happens with AI eventually, authors will probably follow. What people get from having their sites ingested by AI is still unknown today — if the model doesn’t send traffic to your site, you can’t make ad money and can’t support your site (if it’s big). Data licensing that allows AI platforms to pay authors for useful content may be a solution someday

29athrowaway · on March 3, 2024

There are many entities out there crawling and scraping the web in non-compliant ways.

sackfield · on March 2, 2024

It seems quite appropriate that luddites will opt out of the AI training space. Hopefully it will lead to better quality output.

__loam · on March 2, 2024

The AI models would not exist without data the so called luddites created.

sackfield · on March 3, 2024

The best content is not created by luddites. Creative curiousity is generally a necessity for great art.

__loam · on March 3, 2024

I find that people who are willing to not take shortcuts are often the ones making better work, but that's just my opinion.