Just what *does* robots.txt mean for a LOD site?

Hi.

I�m pretty sure this discussion suggest that we (the LD community) should come try to come to some consensus of policy on exactly what it means if an agent finds a robots.txt on a Linked Data site.

So I have changed the subject line - sorry Chris, it should have been changed earlier.

Not an easy thing to come to, I suspect, but it seems to have become significant.
Is there a more official forum for this sort of thing?

On 26 Jul 2014, at 00:55, Luca Matteis <lmatteis@gmail.com> wrote:

> On Sat, Jul 26, 2014 at 1:34 AM, Hugh Glaser <hugh@glasers.org> wrote:
>> That sort of sums up what I want.
> 
> Indeed. So I agree that robots.txt should probably not establish
> whether something is a linked dataset or not. To me your data is still
> linked data even though robots.txt is blocking access of specific
> types of agents, such as crawlers.
> 
> Aidan,
> 
>> *) a Linked Dataset behind a robots.txt blacklist is not a Linked Dataset.
> 
> Isn't that a bit harsh? That would be the case if the only type of
> agent is a crawler. But as Hugh mentioned, linked datasets can be
> useful simply by treating URIs as dereferenceable identifiers without
> following links.
In Aidan�s view (I hope I am right here), it is perfectly sensible.
If you start from the premise that robots.txt is intended to prohibit access be anything other than a browser with a human at it, then only humans could fetch the RDF documents.
Which means that the RDF document is completely useless as a machine-interpretable semantics for the resource, since it would need a human to do some cut and paste or something to get it into a processor.

It isn�t really a question of harsh - it is perfectly logical from that view of robots.txt (which isn�t our view, because we think that robots.txt is about "specific types of agents�, as you say).

Cheers
Hugh

-- 
Hugh Glaser
   20 Portchester Rise
   Eastleigh
   SO50 4QS
Mobile: +44 75 9533 4155, Home: +44 23 8061 5652

Received on Saturday, 26 July 2014 11:18:06 UTC