Use Node.js to Extract Data from the Web

STRML · on Aug 25, 2013

Don't forget streams, the more `node.js` way to parse HTML:

    var http = require('http');
    var tr = require('trumpet')();
    var request = require('request');
    request.get('http://www.echojs.com")
      .pipe(tr.createReadStream("article > span"))
      .pipe(process.stdout);

That's it! See https://github.com/substack/node-trumpet and their tests for more.

substack · on Aug 26, 2013

You probably meant:

    var tr = require('trumpet')();
    tr.createReadStream('article > span')
      .pipe(process.stdout);
    
    var request = require('request');
    request.get('http://www.echojs.com').pipe(tr);

Bonus: I just noticed a simple bug in the selector engine from running your intended code that I just fixed in [email protected].

kanzure · on Aug 26, 2013

And then there's hyperquest because maybe you want to do more than five simultaneous requests:

https://github.com/substack/hyperquest

ssafejava · on Aug 26, 2013

True - you can also disable the globalAgent or change the number of pooled connections. Connection pooling was generally a bad idea (tm) in Node and afaik will be removed in the near future.

zenocon · on Aug 26, 2013

I've done a considerable amount of scraping; if you're poking around at nicely designed web pages, node/cheerio will be nice, but if you need to scrape data out of a DOM mess with quirks and iframes w/in iframes and forms buried 6 posts deep (inside iframes with quirks), I'd use PhantomJS + CasperJS. Having a real browser sometimes makes a difference.

MrBlue · on Aug 26, 2013

PhantomJS + CasperJS is definitely the way to go when scraping data from complex pages. It's also great for circumventing bot detection. :)

enscr · on Aug 26, 2013

I find scrapy (python) to be more robust for large scale scraping. There are cases where you want/need the javascript action and that's when you need a real browser. Otherwise the rendering would just slow things down.

techaddict009 · on Aug 26, 2013

Does this help in scraping website which provide data via jquery ? I mean does this render the javascript on page ?

klibertp · on Aug 26, 2013

Yes. It interprets and executes JS like a real browser would. Which is nice. For Python: http://jeanphix.me/Ghost.py/

nodesocket · on Aug 25, 2013

Have you played around with node.io? https://github.com/chriso/node.io

Encapsulates all this functionality in an easy to use interface.

httpteapot · on Aug 25, 2013

Last commit 3 months ago. Do you know if this project is still alive?

nacs · on Aug 25, 2013

Haven't used node.io but 3 months isn't that old.

Also, if you check the issues page for the project ( https://github.com/chriso/node.io/issues ), the author seems to be responding to any open issues with the latest comment by author being a month ago.

chrisohara · on Aug 26, 2013

Author here.

Still active, although development has slowed down.

If you have any questions or issues just submit an issue @ Github and I'll help asap.

MrBlue · on Aug 26, 2013

Node.io is pretty much dead.

nostrademons · on Aug 26, 2013

There're also Node.js bindings for Gumbo if folks want HTML5 compliance:

https://github.com/karlwestin/node-gumbo-parser

It might be interesting if someone were to implement a Cheerio-like API on top of that, as Cheerio has a nicer API but Gumbo's parser is more spec-compliant.

aroman · on Aug 25, 2013

Cheerio is really really awesome. I've used it to build a considerably sophisticated web scraping backend to wrap my school's homework website and re-expose/augment via node/mongo/backbone/websockets.

There are definitely some bugs in cheerio if you're looking to do some really fancy selector queries, but for the most part it's extremely performant and pleasant to use.

If anyone is interested in seeing what a sophisticated, parallalized usage of cheerio looks like, feel free to browse through the app I was mentioning above -- it's open source: https://github.com/aroman/keeba/blob/master/jbha.coffee

victorhooi · on Aug 26, 2013

Hmm, interesting.

I'm also looking at doing a web-scraping project with Node.js.

I was going to go with CasperJS (http://casperjs.org/), which seems fairly active and is based on PhantomJS.

Their quickstart guide is actually creating a scraper:

http://docs.casperjs.org/en/latest/quickstart.html

However, I'm wondering how this (Cheerio) compares - anybody have any experiences?

dfrodriguez143 · on Aug 26, 2013

I like to use the readability API so I don't need to see the HTML of every single site. I did an example here: http://danielfrg.github.io/blog/2013/08/20/relevant-content-...

premasagar · on Aug 26, 2013

See also http://noodlejs.com for a Node-based web scraper that also handles JSON and other file formats.

It was initially built as a hack project to replace a core subset of YQL. (I helped to guide an intern at my company Dharmafly, Aaron Acerboni, when he built it).

chatman · on Aug 25, 2013

Isn't scrapy easier to use than this?

hackula1 · on Aug 25, 2013

Cheerio is really easy for anyone familiar with jQuery (most node.js devs I would imagine).

level09 · on Aug 25, 2013

its probably more organized and easier to read than a huge number of nested callbacks

AsymetricCom · on Aug 25, 2013

there's a lot of better ways to do this. Most of them involve documented standards so your code doesn't break the moment someone changes something.

mholt · on Aug 25, 2013

This is cool... if the content is structured. (Ever tried finding addresses in arbitrary text? Much harder: http://smartystreets.com/products/liveaddress-api/extract)

babby · on Aug 25, 2013

Come on, that's not really a scraping problem, it's more of a text parsing problem coupled with an API lookup or scrape to verify the address.

Though, id probably just google for some good address regexes, match against pages, for each address just throw them into something like maps.google.com/?q=[address] then try to scrape whatever normally pops up for a valid result. Also helps if you're expecting addresses to be in a certain country.

greenido · on Aug 26, 2013

Similar to what I wrote a week ago: http://greenido.wordpress.com/2013/08/21/yahoo-finance-api-w... :)

tommoor · on Aug 25, 2013

I run an API that could help with this type of thing where the page includes microformats (A surprising amount) at http://pagemunch.com

shospes · on Aug 26, 2013

We also used cheerio and node.js and built an click & extract interface around it: http://www.site2mobile.com/.

garyjob · on Aug 26, 2013

Interesting, I encounter the same set of problems as well last year when working on two side projects. Ended up building a webscraping service with a point and click interface on top of it : https://krake.io

tectonic · on Aug 26, 2013

Remember to use SelectorGadget (http://selectorgadget.com) to help generate your CSS selectors.

level09 · on Aug 25, 2013

here is how I like to do it :

  from pyquery import PyQuery as pq
  doc = pq('http://google.com')
  print doc('#hplogo')

zerni · on Aug 26, 2013

nice!

I did a webcrawler with node.js myself last year. It's only a quick try but you can find the worker class here: https://gist.github.com/zerni/6337067

Unfortunately jsdom had a memory leak so the crawler died after a while...

cheeaun · on Aug 26, 2013

If you want to fix the memory leak, I remember you need to do `window.close()` after the job is done.

zerni · on Aug 26, 2013

thanks mate!