One advantage Google has here over everyone else is they can tie it to search crawling and say “sure, you can block Gemini but only by blocking GoogleBot”. That’s a tradeoff very few would make, whereas there’s not much to lose by blocking OpenAI. This gives Google a massive advantage over competitors in terms of training data.
How does one differentiate between "AI web crawlers" and "non-AI web crawlers".
What faith can be placed in User-Agent strings. The contents of this header have been faked since the birth of the www in the early 90s.
How does anyone know what someone will do with the data they have crawled. There are no transparency requirements, there are no legally-enforceable agreements. There are only IP addresses and User-Agent headers.
No one can look at a User-Agent header or an IP address and conclude, "I know what someone will do with the data I let them access because this header or address confirms it". Those accessing the data could use it for any purpose or transfer it to someone else to use for any purpose. Unless the website operator has a binding agreement with the organisation doing the crawling, any assumptions about future behaviour made on basis of a header or IP address offer no control whatsoever.
Perhaps an IP address could be used to conclude the HTTP requests are being sent by Company X, but the address does not indicate what Company X will do with the data it collects. Company X can do whatever it wants. It does not need to tell anyone what it is doing; it could make up a story about what it is doing that conveniently conceals facts.
These so-called "tech" companies that are training "AI" are secretive and non-transparent. Further, they will lie when it suits their interests. They do not ask for permission, they only ask for forgiveness. They are strategic and unfortunately deceptive in what they tell the public. "Trust" at own risk.
Although it may be useless as a means of controlling how crawling data is used, it still makes sense to me to put something in robots.txt to indicate there is no consent given for crawling for the purpose of training "AI". Better would be to publish some explicit notice to the public that no consent is given to use data from the website to train "AI".
Put the restrictions in a license. Let the so-called "tech" companies assent to that license. Then, when the evidence of unauthorised data use becomes availalble, enforce the license.
robots.txt being an honor system has always been a bit weird and search engine treatment is odder still.
Google will still index your disallowed pages but refuse to show a description since they are couldn't read the page. The premise being that Google could have discovered your URLs from other pages and they don't consider a robots.txt disallow to indicate that you don't want them in search, just that you don't want Google to directly read the page.
To actually disallow indexing of a page a page you need to allow it in robots and then add a noindex to the page itself.
No one has really explained how these AI agents will treat robots.txt. Does it just mean that Bard/Gemini won't directly read the page but might somehow incorporate it into a dataset?
For a test I asked Gemini "What is the URL of the String api documentation for Java 6?" which is disallowed by robots.txt.
The response was: Unfortunately, Oracle does not maintain separate API documentation for older versions of Java like Java 6. The official documentation website (https://docs.oracle.com/javase/7/docs/api/) only provides documentation for the latest versions.
Oracle has archieved some of the older Java API documentation but 6 is still available at https://docs.oracle.com/javase/6/docs/api/index.html. Funny enough the Java 7 API docs which Gemini linked too is also disallowed but maybe that happened after the model was trained.
We don't actually know what they trained on. If we want to trust them, they've always honored robots.txt, because why wouldn't they? On the other hand, if we want to believe they're as capricious as we want to make them out to be, they've got a copy of LibGen, and every single torrent out there to train it in on, on top of using robots.txt to find what you don't want crawled. robots.txt is great - it tells you where your evil scraper should go to get the good stuff.
I don't see how a robots.txt itself would be legally binding, but I guess you could make an argument for the terms of service? Would love to know more about this as well.
Given that Google has honored robots.txt for many years, not sure this is true. It mostly means that blocked sites might not show up on the biggest, most trustworthy platforms. If these AI platforms become a big way people consume content from the internet, authors will have an interest in being there.
Today, people make money from web traffic, and therefore want their sites indexed by search indices. If the same happens with AI eventually, authors will probably follow. What people get from having their sites ingested by AI is still unknown today — if the model doesn’t send traffic to your site, you can’t make ad money and can’t support your site (if it’s big). Data licensing that allows AI platforms to pay authors for useful content may be a solution someday