True - you can also disable the globalAgent or change the number of pooled connections. Connection pooling was generally a bad idea (tm) in Node and afaik will be removed in the near future.
I've done a considerable amount of scraping; if you're poking around at nicely designed web pages, node/cheerio will be nice, but if you need to scrape data out of a DOM mess with quirks and iframes w/in iframes and forms buried 6 posts deep (inside iframes with quirks), I'd use PhantomJS + CasperJS. Having a real browser sometimes makes a difference.
I find scrapy (python) to be more robust for large scale scraping. There are cases where you want/need the javascript action and that's when you need a real browser. Otherwise the rendering would just slow things down.
Also, if you check the issues page for the project ( https://github.com/chriso/node.io/issues ), the author seems to be responding to any open issues with the latest comment by author being a month ago.
It might be interesting if someone were to implement a Cheerio-like API on top of that, as Cheerio has a nicer API but Gumbo's parser is more spec-compliant.
Cheerio is really really awesome. I've used it to build a considerably sophisticated web scraping backend to wrap my school's homework website and re-expose/augment via node/mongo/backbone/websockets.
There are definitely some bugs in cheerio if you're looking to do some really fancy selector queries, but for the most part it's extremely performant and pleasant to use.
If anyone is interested in seeing what a sophisticated, parallalized usage of cheerio looks like, feel free to browse through the app I was mentioning above -- it's open source: https://github.com/aroman/keeba/blob/master/jbha.coffee
See also http://noodlejs.com for a Node-based web scraper that also handles JSON and other file formats.
It was initially built as a hack project to replace a core subset of YQL. (I helped to guide an intern at my company Dharmafly, Aaron Acerboni, when he built it).
Come on, that's not really a scraping problem, it's more of a text parsing problem coupled with an API lookup or scrape to verify the address.
Though, id probably just google for some good address regexes, match against pages, for each address just throw them into something like maps.google.com/?q=[address] then try to scrape whatever normally pops up for a valid result. Also helps if you're expecting addresses to be in a certain country.
Interesting, I encounter the same set of problems as well last year when working on two side projects. Ended up building a webscraping service with a point and click interface on top of it : https://krake.io