> Google itself got big by indexing other people's data without compensation Wro...

karamanolev · 2025-02-07T12:35:57 1738931757

> Web site owners chose to make it available to Google.

Strong disagree. Since robots.txt is optional and the default is "crawl me as you please", website owners don't "choose to make it available", they just don't choose to make it non-available.

XorNot · 2025-02-07T12:53:15 1738932795

That's a functionally meaningless distinction. If you setup a web server that responds to requests, then you're choosing to make content available because your server can choose to not respond to requests. The entire protocol includes mechanisms to negotiate access.

jokethrowaway · 2025-02-07T13:16:14 1738934174

Granting access and granting right to redistribute (even just title + snippet) and use your content commercially are two completely different things.

XorNot · 2025-02-07T20:29:16 1738960156

And yet it is legal to produce and redistribute summaries as sufficiently transformative derivative works, and this has been court tested[1]. Of course in Australia we passed rather specific laws to the contrary, because lo and behold Rupert Murdoch wanted money and gosh darn it our government was going to give it to him[2].

[1] https://www.practicalecommerce.com/Search-Engines-Indexing-a...

[2] https://www.alrc.gov.au/publication/copyright-and-the-digita...

eviks · 2025-02-07T13:16:56 1738934216

This is a meaningless simplification. In this framework "robots.txt" has no role, because your server "can choose" not to respond. Heck, even DDOS is fine, because "protocol"

RALaBarge · 2025-02-07T12:35:28 1738931728

To your first point, the op said without compensation, not without permission.

tobyhinloopen · 2025-02-07T12:52:13 1738932733

a) If you don't have a robots.txt, you're indexed by default. It's opt-out, not opt-in. If you do nothing, you're being indexed.

antiframe · 2025-02-07T17:50:19 1738950619

It's an opt-out of an opt-in. If you run a webserver hosting your files, you already opted-in to people accessing that data. If you then don't go ahead an configure it properly, that's not exactly "opt-out" anymore. By default your files are not accessible to the network, you have to first opt-in to serving them.

tobyhinloopen · 2025-02-08T08:44:15 1739004255

Google makes a copy of your data and serves that data to users before they visit your site.

Also google cache allows users to get a copy of your site without visiting your site.

Why can they republish your data while we cannot? Why do we have to opt-out?

veggieroll · 2025-02-07T13:26:36 1738934796

Robots.txt is irrelevant after hiQ Labs v. LinkedIn (2019)

fredgrott · 2025-02-07T12:35:52 1738931752

point c is wrong...they had ads since the original yahoo contract....

threeseed · 2025-02-07T12:38:31 1738931911

Yahoo contract was 2 years after it launched.

I remember using Google the day it went public and it had no ads which made it unique compared to Altavista.

boesboes · 2025-02-07T12:36:06 1738931766

Wrong. Google ignores robots.txt entirely

threeseed · 2025-02-07T12:41:17 1738932077

I wasn't aware. Can you please update Wikipedia then: https://en.wikipedia.org/wiki/Robots.txt

Maybe also get Google to update their docs: https://developers.google.com/search/docs/crawling-indexing/...

phit_ · 2025-02-07T13:08:27 1738933707

their own docs also specify that the robots.txt does not stop indexing or showing up in search, they even bolded it "it is not a mechanism for keeping a web page out of Google"

https://developers.google.com/search/docs/crawling-indexing/...

threeseed · 2025-02-07T20:55:10 1738961710

From the docs:

“While Google won't crawl or index the content blocked by a robots.txt file”

They will show the URL if someone else has linked to it. But the content itself is not indexed.

alphan0n · 2025-02-07T13:48:03 1738936083

The only way for links to appear in a Google search would be to host a public resource, that is linked from another public resource.

If you have specified in your robots.txt that you do not want the page(s) or directories ingested then only the url is indexed (if it is linked from another page). It does prevent the public display of the content of a page and creation description/summary.

https://support.google.com/webmasters/answer/7489871?hl=en

nottorp · 2025-02-07T12:50:35 1738932635

It must be nice to believe everything people say by default... ;)