Hacker News new | past | comments | ask | show | jobs | submit login
Use Node.js to Extract Data from the Web (storminthecastle.com)
80 points by johnrobinsn on Aug 25, 2013 | hide | past | favorite | 34 comments



Don't forget streams, the more `node.js` way to parse HTML:

    var http = require('http');
    var tr = require('trumpet')();
    var request = require('request');
    request.get('http://www.echojs.com")
      .pipe(tr.createReadStream("article > span"))
      .pipe(process.stdout);


That's it! See https://github.com/substack/node-trumpet and their tests for more.


You probably meant:

    var tr = require('trumpet')();
    tr.createReadStream('article > span')
      .pipe(process.stdout);
    
    var request = require('request');
    request.get('http://www.echojs.com').pipe(tr);
Bonus: I just noticed a simple bug in the selector engine from running your intended code that I just fixed in [email protected].


And then there's hyperquest because maybe you want to do more than five simultaneous requests:

https://github.com/substack/hyperquest


True - you can also disable the globalAgent or change the number of pooled connections. Connection pooling was generally a bad idea (tm) in Node and afaik will be removed in the near future.


I've done a considerable amount of scraping; if you're poking around at nicely designed web pages, node/cheerio will be nice, but if you need to scrape data out of a DOM mess with quirks and iframes w/in iframes and forms buried 6 posts deep (inside iframes with quirks), I'd use PhantomJS + CasperJS. Having a real browser sometimes makes a difference.


PhantomJS + CasperJS is definitely the way to go when scraping data from complex pages. It's also great for circumventing bot detection. :)


I find scrapy (python) to be more robust for large scale scraping. There are cases where you want/need the javascript action and that's when you need a real browser. Otherwise the rendering would just slow things down.


Does this help in scraping website which provide data via jquery ? I mean does this render the javascript on page ?


Yes. It interprets and executes JS like a real browser would. Which is nice. For Python: http://jeanphix.me/Ghost.py/


Have you played around with node.io? https://github.com/chriso/node.io

Encapsulates all this functionality in an easy to use interface.


Last commit 3 months ago. Do you know if this project is still alive?


Haven't used node.io but 3 months isn't that old.

Also, if you check the issues page for the project ( https://github.com/chriso/node.io/issues ), the author seems to be responding to any open issues with the latest comment by author being a month ago.


Author here.

Still active, although development has slowed down.

If you have any questions or issues just submit an issue @ Github and I'll help asap.


Node.io is pretty much dead.


There're also Node.js bindings for Gumbo if folks want HTML5 compliance:

https://github.com/karlwestin/node-gumbo-parser

It might be interesting if someone were to implement a Cheerio-like API on top of that, as Cheerio has a nicer API but Gumbo's parser is more spec-compliant.


Cheerio is really really awesome. I've used it to build a considerably sophisticated web scraping backend to wrap my school's homework website and re-expose/augment via node/mongo/backbone/websockets.

There are definitely some bugs in cheerio if you're looking to do some really fancy selector queries, but for the most part it's extremely performant and pleasant to use.

If anyone is interested in seeing what a sophisticated, parallalized usage of cheerio looks like, feel free to browse through the app I was mentioning above -- it's open source: https://github.com/aroman/keeba/blob/master/jbha.coffee


Hmm, interesting.

I'm also looking at doing a web-scraping project with Node.js.

I was going to go with CasperJS (http://casperjs.org/), which seems fairly active and is based on PhantomJS.

Their quickstart guide is actually creating a scraper:

http://docs.casperjs.org/en/latest/quickstart.html

However, I'm wondering how this (Cheerio) compares - anybody have any experiences?


I like to use the readability API so I don't need to see the HTML of every single site. I did an example here: http://danielfrg.github.io/blog/2013/08/20/relevant-content-...


See also http://noodlejs.com for a Node-based web scraper that also handles JSON and other file formats.

It was initially built as a hack project to replace a core subset of YQL. (I helped to guide an intern at my company Dharmafly, Aaron Acerboni, when he built it).


Isn't scrapy easier to use than this?


Cheerio is really easy for anyone familiar with jQuery (most node.js devs I would imagine).


its probably more organized and easier to read than a huge number of nested callbacks


there's a lot of better ways to do this. Most of them involve documented standards so your code doesn't break the moment someone changes something.


This is cool... if the content is structured. (Ever tried finding addresses in arbitrary text? Much harder: http://smartystreets.com/products/liveaddress-api/extract)


Come on, that's not really a scraping problem, it's more of a text parsing problem coupled with an API lookup or scrape to verify the address.

Though, id probably just google for some good address regexes, match against pages, for each address just throw them into something like maps.google.com/?q=[address] then try to scrape whatever normally pops up for a valid result. Also helps if you're expecting addresses to be in a certain country.



I run an API that could help with this type of thing where the page includes microformats (A surprising amount) at http://pagemunch.com


We also used cheerio and node.js and built an click & extract interface around it: http://www.site2mobile.com/.


Interesting, I encounter the same set of problems as well last year when working on two side projects. Ended up building a webscraping service with a point and click interface on top of it : https://krake.io


Remember to use SelectorGadget (http://selectorgadget.com) to help generate your CSS selectors.


here is how I like to do it :

  from pyquery import PyQuery as pq
  doc = pq('http://google.com')
  print doc('#hplogo')


nice!

I did a webcrawler with node.js myself last year. It's only a quick try but you can find the worker class here: https://gist.github.com/zerni/6337067

Unfortunately jsdom had a memory leak so the crawler died after a while...


If you want to fix the memory leak, I remember you need to do `window.close()` after the job is done.


thanks mate!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: