- From: Hugh Glaser <hg@ecs.soton.ac.uk>
- Date: Thu, 28 Jan 2010 12:26:29 +0000
- To: Stephane Corlosquet <scorlosquet@gmail.com>
- CC: "public-lod@w3.org" <public-lod@w3.org>
Thanks for the pointer. (Won�t actually look at the ARC code at the moment, as it may be hard to comply with Benji�s license.) However, rather than being as clever as possible, somehow I thought I should respect what the publisher said, so perhaps first Content-Type, then extension, rather than ignoring them. The reason I wasn�t relying on rapper --guess is that the handover to rapper is part of the RDF store, and I will probably use other stores that don�t use rapper. Also, I wanted to gather statistics on what RDF format people were using, and couldn�t see an option to rapper to tell me the input type that it guessed. At the moment I record the Content-Type and the extension, and then let rapper or whatever do their magic � I guess that is enough. Cheers Hugh On 28/01/2010 02:25, "Stephane Corlosquet" <scorlosquet@gmail.com> wrote: Hugh, The ARC2 parser has a "built-in RDF format detector" [1]. You might want to look at the code to see how it's done. Why not using the --guess option of rapper? Steph. [1] http://arc.semsol.org/docs/v2/parsing On Wed, Jan 27, 2010 at 9:08 PM, Hugh Glaser <hg@ecs.soton.ac.uk> wrote: On 27/01/2010 09:49, "Tom Heath" <tom.heath@talis.com> wrote: > +1 for Moriarty, whether you're working with the Platform or not. Ian > and the other contributors have done a great job - personally I'd > start here before writing any new code. Too true mate. Now my next bit of pissing about. Before writing it (if I can find the gumption). Don't think this is in Moriarty, as the Talis Platform is, of course, well-behaved. I run cURL, using an amended version of what was described before (as at the end of this message). So now I need to deal with what comes back. I actually hand it over to rapper, so would sort of like to know what the data is to improve the reliability by setting the rapper type parameter. I am trying to avoid looking inside the file, although am happy to if someone can provide the code :-). The Content-Type is unreliable � for example could (is likely to) be text/plain for a turtle file that someone has put on a standard web server. So it is the usual problem of messing about with extensions, modified by extra information from the Content-Type. Of course we need to worry about the final URL (curl_getinfo($ch)['url']), possibly as well as the requesting URI, as that might be where there is an extension. So perhaps something that sets the Content-Type in curl_getinfo($ch) as best it can? Any offers? (Pretty please!) And maybe we can feed back to Moriarty, PEAR, etc, unless already there and I missed it. On another worry, If the requesting URI does a 302 to a new URI, which then does 303, it looks an interesting challenge to capture the new URI as expected. I don�t intend to do this at the moment, but if anyone has done that, ... Enjoy. Hugh PHP much preferred. Fetching code: $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $_REQUEST['uri']); curl_setopt($ch, CURLOPT_USERAGENT, "http://void.rkbexplorer.com/ submission agent 1.0"); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_setopt($ch, CURLOPT_HTTPHEADER, array("Accept: application/rdf+xml, text/n3, text/rdf+n3, text/turtle, application/x-turtle, application/turtle, text/plain")); $data = curl_exec($ch); $info = curl_getinfo($ch); curl_close($ch); > > My 2p worth :) > > Tom. > > > 2010/1/26 Ian Davis <lists@iandavis.com>: >> You may find something useful in my Moriarty project: >> >> http://code.google.com/p/moriarty/ >> >> It's geared towards the Talis Platform but there is a lot of code in >> there that has no dependencies on the platform, e.g.: >> >> http://code.google.com/p/moriarty/source/browse/trunk/httprequest.class.php >> >> some documentation for that class here: >> >> http://code.google.com/p/moriarty/wiki/HttpRequest >> >> Ian >> >> >> ______________________________________________________________________ >> This email has been scanned by the MessageLabs Email Security System. >> For more information please visit http://www.messagelabs.com/email >> ______________________________________________________________________ >> > >
Received on Thursday, 28 January 2010 12:27:28 UTC