> Google itself got big by indexing other people's data without compensation
Wrong.
a) Robots.txt which defines what content you wish to make available to third parties predates every search engine including Google. Web site owners chose to make it available to Google and search engines have respected their wishes despite it not being in their best interest.
b) The difference here is that OpenAI, Meta etc have not even tried to honour the wishes of copyright holders. They just considered everything as theirs.
c) Google grew big because it had no ads, fast interface and PageRank was significantly better. It wasn't because it had the most comprehensive index.
> Web site owners chose to make it available to Google.
Strong disagree. Since robots.txt is optional and the default is "crawl me as you please", website owners don't "choose to make it available", they just don't choose to make it non-available.
That's a functionally meaningless distinction. If you setup a web server that responds to requests, then you're choosing to make content available because your server can choose to not respond to requests. The entire protocol includes mechanisms to negotiate access.
And yet it is legal to produce and redistribute summaries as sufficiently transformative derivative works, and this has been court tested[1]. Of course in Australia we passed rather specific laws to the contrary, because lo and behold Rupert Murdoch wanted money and gosh darn it our government was going to give it to him[2].
This is a meaningless simplification. In this framework "robots.txt" has no role, because your server "can choose" not to respond. Heck, even DDOS is fine, because "protocol"
It's an opt-out of an opt-in. If you run a webserver hosting your files, you already opted-in to people accessing that data. If you then don't go ahead an configure it properly, that's not exactly "opt-out" anymore. By default your files are not accessible to the network, you have to first opt-in to serving them.
their own docs also specify that the robots.txt does not stop indexing or showing up in search, they even bolded it "it is not a mechanism for keeping a web page out of Google"
The only way for links to appear in a Google search would be to host a public resource, that is linked from another public resource.
If you have specified in your robots.txt that you do not want the page(s) or directories ingested then only the url is indexed (if it is linked from another page). It does prevent the public display of the content of a page and creation description/summary.
Wrong.
a) Robots.txt which defines what content you wish to make available to third parties predates every search engine including Google. Web site owners chose to make it available to Google and search engines have respected their wishes despite it not being in their best interest.
b) The difference here is that OpenAI, Meta etc have not even tried to honour the wishes of copyright holders. They just considered everything as theirs.
c) Google grew big because it had no ads, fast interface and PageRank was significantly better. It wasn't because it had the most comprehensive index.