Hacker News new | past | comments | ask | show | jobs | submit login

I've been developing the stack for The ContentMine (http://contentmine.org). In the next month or so, we will start scraping the entire scientific literature as it is published each day, and processing it through our 'fact extraction' pipeline.

Done so far:

- I've defined a JSON format for declarative web scrapers (ScraperJSON: https://github.com/ContentMine/scraperJSON)

- made a Node library for web scraping with ScraperJSON scrapers (thresher: https://github.com/ContentMine/thresher)

- as well as a command-line client (quickscrape: https://github.com/ContentMine/quickscrape)

- and a small library of ScraperJSON scrapers for scientific publishers that is about to start expanding rapidly (https://github.com/ContentMine/journal-scrapers).

Next step is to build the web interface that will let people compose data mining pipelines. Imagine something like:

- "give me a feed of all the articles in journals with 'Cancer' in the title that use HeLa cells in their methods"

- "alert me when a new paper comes out that mentions species X alongside a geographical reference"

- "find all the papers that mention my software in the methods but don't cite me"




I've long thought about doing this with arXiv to determine if you could extract results and synthesize answers to questions. Is anything like that part of the goals for The ContentMine?


Yes. We're trying to build the toolset so that the end product is not predetermined, but that we empower people to make whatever they can imagine. In about 6 months time it should be relatively easy to implement your idea on top of what we build. arXiv will of course be included :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: