Hacker News new | past | comments | ask | show | jobs | submit login
Ferret – Declarative web scraping (github.com/montferret)
260 points by ziflex on Oct 2, 2018 | hide | past | favorite | 92 comments



I think we are confusing two things here.

The OP probably meant this tool as something useful in his situation and shared it with others, in case someone else needs something like that. He was probably thinking along the lines of "If someone needs that, it's fine, let them discover something like that exists and use it". People over here are viewing this as a one-size-fits-all solution for web scraping and pointing out valid reasons why it's a bad idea to use it like that. I think that we should accept that this tool might be good for some, but completely unnecessary for others and we shouldn't criticize it for not being useful for our purposes.

One situation where this tool has clear advantages over other solutions is client-side scraping. If we made an app for ios/android/Windows/whatever that runs on devices owned by end-users and crawls data upon request, perhaps from multiple websites, having the crawlers written as external scripts would be extremely useful. That would allow you tu push updates to the crawlers separately, immediately after a website changes its layout, without the need to update your app. Making a gallery of downloadable crawlers for more sources would, probably, aslo be possible. The limitations of that language are very advantageous in this situation, as crawlers are mostly sandboxed and can't destroy your filesystem, steal your data etc. This tool also allows keeping the crawlers separate from the app. That would allow people to create a global npm-like repository for crawlers working regardless of what programming language you use (provided someone wrote an implementation of this tool in your language). Imagine an use-case like building a books price comparator app in, let's say, Java for android and swift for ios, and maybe even c++ because some libraries still run Windows xp and would like that app to be available, and being able to download the crawlers for Amazon and tens of local bookselling websites that would work in all of those apps, without a need to write them yourself. If used right, this tool could actually allow programmers to imagine that there's actually a semantic web as originally imagined and write services that interact with various websites in surprising ways, without thinking about how the interactions are done on a lower level.


Thank you very much for your valuable feedback and I'm glad that someone has finally got the idea :)


I'm sorry but there is nothing new here? This seems like a backwards step if anything.

Usually when web scraping, I can just load in the HtmlAgilityPack (c#), point it at a URL then write some functional code to extract the necessary data.

Even better, I'll examine the website in Fiddler and hope they have a data-view separation going on, and be able to just intercept the json file they load instead.

Worse case scenario I need to dynamically click on buttons etc, but this can usually be handled by selenium, or if they detect that just roll a custom implementation of CefSharp (again not hard, just download the nuget, and it lets you run your own custom javascript).

A new, more limited, language (with no IDE tools) is not the way to go. If anything a better web scraper just make the above processes I mentioned more seamless, for example combining finding/selecting of elements in chrome with codegen.


The main advantage of this over your approach with HtmlAgilityPack is that Ferret can handle dynamic web pages - those that are rendered with JS. And also, it can emulate user interactions. But anyway, thanks for your feedback :)


I think AngleSharp can handle JS and it's not that different from HtmlAgilityPack. https://github.com/AngleSharp/AngleSharp


14th of July this year: https://github.com/AngleSharp/AngleSharp/issues/693. I'm not sure there's much JS support.


The code for doing this isn't too difficult with https://github.com/chromedp/chromedp, is this just some helpers around that? I haven't used it or puppeteer on the node side that heavily, but what have you found difficult that deserves this kind of wrapper/abstraction instead of direct library use?


Right but in that case that implies a view-model separation, so you can usually just access the data file directly, which is usually json.


I'm sorry I do not fully understand what you mean. Imagine, that you need to grab some data from SoundCloud and, also imagine, they do not have a public API :) How would you do it without launching a browser?


You just need to look at the packets in Fiddler as the page loads, find the request that gets the data you want, and then clone that request in your application.

I just took a look at soundcloud and was able to get the data within a minute, as it's a basic setup. They use a json structure (most websites do) [1], and the data is fetched with a simply GET request.

If the website requires authentication, you just need to clone those requests too, and then you'll get some sort of cookie/session id back which you attach to any future requests.

As a bonus: You can then throw the json into a code converter online too which will convert your structure to a Type as well (useful if you are using a static language).

[1] https://imgur.com/a/Einoxdg


This is so 2010...websites now use a mix of serverside and client side. You can't just watch http requests and figure out the "api" specifications. Just try to scarpe adwords/gmail/amazon etc. Consider that they may also use anti-scarping code on the client.


Even without a documented API, there is often one at play using websites. Example: Website is available at "http://example.com/books/1234". When loading it, you see that it fires a request to "http://example.com/api/booksdata/1234" to load the data that popluates the page. So now you don't have to use a slow browser that loads everything, but can just use your normal http client (for all the ids you know).


That's true. You can do it, of course. I'm not saying that this is the only way of doing it.


I think what's being noted is that when the data comes as a data structure on the page, or as data passed back from an XHR request, you can just use that data directly and there's less page scraping to be done. This is generally how dynamic pages are created, out of a shipped data structure and rules to create the page out of it. If you have the data structure, it's generally much easier to parse than the page generated from it.

That said, for pages that use a background request to fetch the data, this can be useful, as that data used to build the page isn't always kept around as a data structure (at least not one easily accessible) afterwards. That is if accessing the endpoint of the background request isn't feasible for some reason.


He means that you look at the private API and use that.


> I'll examine the website in Fiddler

Isn't that overkill compared to just using the built-in browser devtools Network tab?


I find the Fiddler UI easy to use, plus you can add plugins such as converting the request straight to code. Up to you what tool you use of course.


Both Firefox and Chrome have a Copy As Curl feature which is really useful. I agree with you though, the UI sucks.


I think I have never seen so many negative comments before. Yes, OPs idea might be a bit unnecessary but it surely looks interesting. It could mature to a very interesting project. And depending on your business, if you rely heavily on crawling, having a specific language for it might help with making code more uniform


What makes this language "declarative"? It looks pretty imperative to me.


@OP/project creator: If your language allows the writer to define a program's control flow, more often than not it's imperative, and not declarative.


I'm not actually seeing much code controlling the flow of the program.


I mean, I can remove control flow constructs from an imperative language and call it declarative, but that wouldn't be very useful.

What is the advantage of it being declarative? At least for the example, the equivalent imperative code is about the same length.


I don't think there is much advantage in being declarative in this case because whenever I need to scrape stuff I have to do a lot of edge case handling and I want to be in control, but it does seem to be a recurring dream to make some sort of web query language (based on my memories of Old Dr. Dobbs issues)


The flow in the example is simple: click -> wait -> process result; but it's still there. Notice that instructions are executed in the order they are defined, that is enough to make it imperative.


yes but I think it could be made truly declarative with very little work. take out the waits, and maybe not carry around the doc all over the place - the context of the page being processed should be figured out by the interpreter.

on edit: although the web interfaces being what they are some things need to be order dependent - like INPUT(google, 'input[name="q"]', "ferret") CLICK(google, 'input[name="btnK"]'), I mean you need to click the button after you fill out the input.


Well, I would say it's more declarative than imperative. There are few differences that make it less declarative - variables and ternary operator.


came to say this. This isn't declarative!


Yeah I opened the page kind of hoping for something GraphQL. Scraping libs are cool but I have absolutely no idea why this is at the top of Hacker News. It's not declarative, it's unnecessarily a language instead of just a library, the only thing it has that HN would love is that it's written in Go and it's kind of an esoteric niche. Is that really all it takes?


To be honest, I am not sure why this needs yet another query language.. :(


Actually, it's not yet another language :) The syntax is taken from ArangoDB - AQL https://docs.arangodb.com/3.3/Manual/


I was expecting an ML-driven framework where you write the HTML you want to scrape, and the framework diffs the trees and attempts to extract the information from the target tree as best it can to match your input tree. That's what pops into mind when I think of "declarative" scraping.

  LET google = DOCUMENT("https://www.google.com/", true)

  INPUT(google, 'input[name="q"]', "ferret")
  CLICK(google, 'input[name="btnK"]')
  WAIT_NAVIGATION(google)
  LET result = (
    FOR result IN ELEMENTS(google, '.g')
      RETURN {
        title: ELEMENT(result, 'h3 > a'),
          description: ELEMENT(result, '.st'),
          url: ELEMENT(result, 'cite')
      }
  )
  RETURN (
    FOR page IN result
    FILTER page.title != NONE
    RETURN page
  )
Looks an awful lot like:

  const { document, input, elements, waitNavigation } = require("your-library")
  const scrape = () => {
    let google = document("...", true)
    input(google, "...", "...")
    click(google, "...")
    waitNavigation(google)
    return elements(google, ".g")
      .map(r => {...})
      .filter(p => {..})
  }
  scrape();
Am I missing something here? I don't see anything declarative about the the first one over the second; both of these look identical and rather imperative to me. Is "declarative" becoming a buzzword (thanks to React, maybe?), or am I missing something?


This looks all wrong. Page scraping is not accessing a data source in a way that means a query language makes any sense. The moment you need to interact with the page and admit there's a dom under there it breaks the idiom.

And why it's remaking variable declaration I don't know, and why is the for loop so verbose? If you insist on a query language go the whole way and remove repetition and syntax complexity because that's the only thing that could actually add value.


DOM is a representation of some data. Which means, you can extrapolate the data and then manipulate it. The language itself has nothing related to the DOM. All DOM operations are implemented via functions from standard library.

"Good artist copy, great artist steal" I'm trying to be a good artist trying to not invent a new brand language (I'm not that smart), so I just picked up (copied) an existing one that fits better for dealing with complex structures like trees. So it is AQL - ArangoDB Query Language. https://docs.arangodb.com/3.3/Manual/

If you have any suggestions how to improve the language - you are very welcome.


How about using an existing language, like Python? You can make a really great DSL using Python, and then people have access to all the other Python language features that they already know, and the stdlib that they already know, and 3rd-party modules they already know..


I could, if I knew Python pretty well :) But I've done it in the way I needed it to be done. I wanted to have an isolated and safe environment that would allow me to easily scrape the web without dealing with infrastructural code.


Yup, I get it, I want that too, but I don't want to learn another language just to do that :/


I think this could go further in terms of making it declarative.

A simple declarative approach could taking this:

    LET google = DOCUMENT("https://www.google.com/", true)
and instead of thinking about it as an action (get this page), think about it as giving you an object. The result is a tuple of the URL, the time fetched, and maybe other information (like User-Agent). This helps with exploratory scraping, where you want to be able to repeat actions without always re-fetching the documents. And you'll be constructing a program, unlike a REPL where you always write the program top-to-bottom, including all your intermediate bugs.

Changing DOCUMENT() is easy enough. Things like CLICK() are a bit harder, though if you extend the data structures you can have a document that is the result of clicking a certain element in a certain previous document. Again to do it the first time you have to actually DO the action, but later on perhaps not. And you'll be constructing interstitial objects that are great for debugging.

Then what could make it feel really declarative is having more than one presentation of an execution. You can package up a scraping, and then you can answer questions about WHY you ended up with certain results.


That's what you can do right now :)

https://github.com/MontFerret/ferret/blob/master/docs/exampl...

Document, returned form DOCUMENT() function, represents an open browser tab which allows you to do all interactions with the page.


Well, that's what I'm saying... right now, making it represent an open browser tab with a specific state and where everything DOES something isn't declarative. But it could be declarative if you changed how those commands are implemented.

Or, to phrase it another way: if the program represents a PLAN then it's declarative. If it represents a series of things to DO then it's imperative. It seems like it's doing things, but it could plan things with the same syntax.


Oh yes. The reason if this is that for now the language itself is DOM agnostic, it's just a port of an existing one. (https://docs.arangodb.com/3.4/AQL/) . So, the entire DOM thing is implemented by standard library which is pluggable. In the future, I might extend the language to make it less DOM agnostic by introducing new keywords for dealing with that. But for now you have to move document object around. Which is not that bad, because you may open as many page as you want in a single query.


I don't see why web scraping should be declarative at all. XPath is declarative, and hard to use. When a human browses a website, they do one thing at a time. That is inherently imperative. A DSL for highly-imperative "human-style" web scraping is a nice idea in my opinion. That's exactly what Ferret appears to be.


And you can open as many pages as you want in a single query (or as your memory allows you :) )


In late 2000 at the tail end of the bubble bursting, there was a search engine company trying to build on a platform of push notifications instead of crawling.

I know that at my current company the plurality if traffic comes from crawlers. We don’t want to throttle them because that’s biting the hand that feeds but it sure sucks.

I wonder often, how many crawlers you need before it’s cheaper for a website to volunteer up when pages change or new ones arrive.



There are N existing standards for websites to notify crawlers about new content. Few websites use them, and the ones that do are often buggy. A few are even malicious.

I'm looking forward to the N+1th standard.


Parsing the html or traversing DOM is the easy part. Doing request queues, ip rotation, data quality management, exponential backoff etc. on scale is much harder.


PRs are welcome :) There is gonna be a separate project within the organization that would do all these things and even more. It's just beginning :)


Does anyone else find the narrative style in the README (memes, Internet language, etc) obnoxious?


If you're looking for a slightly-more-native way to declaratively scrape (the data-binding aspect, at least) in Go based on CSS selectors, I wrote a wrapper around GoQuery that just uses struct tags for mapping. You don't have to learn a new language, and it should feel familiar if you've ever written CSS or jQuery. I've found it helps reduce a lot of boilerplate and makes things a lot easier to come back and read than what was previously a lot of nested GoQuery functions, selectors, etc.

Might be helpful and in a similar vein. :-)

https://github.com/andrewstuart/goq


Web scraping "at scale" ends up being a lot more complicated than blinding firing HTTP requests. Scrapy, for example, supports arbitrary "middleware" that can, for example, follow 301 Redirect, respect robots.txt files, follow sitemap.xml files, etc.

To what extent is this supported (or, to what extent do you plan to support it?) Similarly, since the front-end language is essentially a compiler, would it be possible to write an alternative "backend" (e.g. something that distributes requests across a cluster)?


This package is more like a runtime. There are plans to create a dedicated server, where you would be able to store your queries, schedule them and set up output streams like Spark or Flink. For now, it does not respect robots.txt. But it can be easily added.

Out of the box, there are not scaling mechanism yet, since the project is WIP. But, it's written in Go, which makes it pretty fast.

One idea of how you could scale it is to run cluster of instances of headless Chrome, put proxy/load balancer in front of it, and get Ferret a url to the cluster. It will treat it as a single instance of Chrome. The only problem, you would need to differentiate request from CDP (Chrome DevTools Protocol) client, and once a page is open, redirect all related requests to the same Chrome instance.


Some sites are so broken that the only way to parse some of their HTML is to use regexps or substring searches on the source code. Maybe this tool can handle that through extensions.


This is imperative scraping, and not as easy as just working with XML (what the page is anyway).


One of the goals of the project is to hide technical details and complexities that follow modern web scraping, especially when you deal with dynamic web pages.


Which is already very easy with puppeteer

I feel in this case, it's solely to stay within a Go ecosystem which seems to be counter to what matters, business value.


Yes, puppeteer and ferret use same technology under the hood - Chrome DevTools Protocol.

But, the purpose of the project is not to "be better than". No, the purpose is to let developers focus on the data itself, without dealing with technical details of underlying technologies. You can treat it as a higher abstraction on top of puppeteer / CDP.

It does not really matter whether it's written in JS or Go. main goal is to simplify data scraping and make it more declarative (even though, some people say it's not a declarative language :))


I think the name could be a problem? Or is this related to ferret the c library with ruby bindings? https://github.com/dbalmain/ferret


Nope, just thought that ferrets are cool :)


This reminds me of Russ Cox's toy Webscript language: https://research.swtch.com/webscript


yetanotherscrapingframework


It's not so interesting to me that people continue to feel the need to build a scraping framework (it's an excellent beginner project because it encompasses so much of web development) but why HackerNews finds scraping frameworks so interesting. There seems to be a scraping framework at the top of HN every week or two.


Personal theory, as someone who has done a good bit of scraping over the last decade:

Extremely common problem space, with a lot of tantalizing opportunities for a "platform" or "shared language" where one doesn't exist. I see existing tooling like BeautifulSoup, Scrapy, Selenium, as isomorphisms to the "near hardware level" of our problem-space-tool-stack, whereas the abstractions and higher level logic is often defined per the use case.

On top of (or perhaps because of) that, one often writes a lot of boilerplate, but when it comes to genericize, one often ends up writing the tool that genericizes within their problem space, and for all the tantilizing opportunities, the number of "not quite fully intersecting scraping problem spaces" (and associated tradeoffs/different paradigms) is far more massive than I considered when I started any of my own scraping tools.

This has lead me to take a very opinionated view with my own tooling, wherein I build for _ONE SPECIFIC RECURRING SCRAPING PRIMITIVE_. (in my case, treating the whole world as a stream of events. You want something that can be more or less massaged into that? Cool; maybe something for you. If not? Probably want to look at a different set of tools)


I think comments here are unfairly harsh. I really like this innovative idea of having a dedicated language. If it can see client-side rendered HTML (e.g. React, Vue, etc...), that would be a whole another level for me since I don't think this has been made before.


It can! :)

Even more - it can interact with these pages! Here is an example of use of Google Search page: https://github.com/MontFerret/ferret/blob/master/docs/exampl...


Wow, thank you. Now I am absolutely in love with your tool!


Great! ^_^


I feel the same here, architecturally I've never liked scrapers which are tightly coupled within a service. Websites are prone to sudden change and it seems unsustainable to redeploy each time a selector changes, it's pretty innovative to leverage the power of *QL for this sort of problem.


Yes! This is one of the reasons why I wanted to be able making these changes without redeploying the whole thing!


I am currently using https://www.npmjs.com/package/proxycrawl web scraping system that offers anonymous scraping too. But they do not have a Go library yet. Does the system support anonymous scraping too like over a proxy? I'd love to use it if I can scrape URLs with proxycrawl API like with this example: https://api.proxycrawl.com/?token=DsFuFiigAZ2Wm6U1BPh7Zw&for...


Proxy support is not in place yet. I will add it in future releases. You are welcome to do a PR :)


Token?


I do not use tcp token, so I could share it for free, i use the api for javascript token which gives me dynamic content more.


The best way to scrape already for many years is using a headless browser plugin. For example phantomjs with nodejs. That in combination with tor or a large proxy pool is unbeatable by all other alternatives.


This is how it works under the hood. But everything is wired for you ;)


I wouldn't scrape over Tor, you would be slowing down the network for people who actually need it. Maybe if you are running a node.


Why is this a language instead of a library on top of an existing language?

Here's what it would look like as a JavaScript (Node.js or browser) library:

    let g = getDocument("https://www.google.com/", true);
    
    g.input('input[name="q"]', "ferret");
    g.click('input[name="btnK"]');
    
    g.waitNavigation();
    
    let result = g.elements('.g').map(({
      title: result.element('h3 > a'),
      description: result.element('.st'),
      url: result.element('cite')
    }));
    
    return result.filter(i => i.title !== null);


John Hammond: I don't think you're giving us our due credit. Our scientists have done things which nobody's ever done before...

Ian Malcolm: Yeah, yeah, but your scientists were so preoccupied with whether or not they could, that they didn't stop to think if they should.


I'm not sure why you're being downvoted. As far as I can tell from the examples, there is nothing that this language brings to the table that couldn't be implemented instead as an API on top of an existing language.


The "funny" part is, while reading this, I kept thinking of a similar tool I built myself as a wrapper around python+beautifulsoup. Definitely parts of what the OP needed were compelling to me (I found regular subpatterns in my scraping work that could be encapsulated really well in certain bits of syntactic sugar, but concluded that a minimal json-like structure for defining a scrape was both sufficient and let me have a graphical "scrape builder" in a UX far more readily than if I actually wrote the scrape as code.

There's the usual amount of HN cynicism in the thread which I'm not sure is 200% off mark, but I think there are some good concepts in "Scraping primitives" that can be contemplated that the OP took an interesting angle on. (or rather; not the angle I took, so interesting to me.)


Is your beautifulsoup wrapper open source?


If other people would find it beneficial it can be; I had admittedly seen it as "just a mess of sugar to help me scrape easier" (with all the typical nervousness of showing others "imperfect" code). I'm aptly in the process of cleaning it up, writing tests and finishing an MVP UI. I can do a show HN in a few weeks once I've found the time to get it ship-shape.


You definitely need to share. Web scraping is tedious. As more ideas we have, as more options we have to come up with a better solution for that.


That's true. The difference is how much efforts is needed to do that using API.

What it brings is just a higher abstraction of that API which lets you easily to get work done.


Do you have a more involved example where Ferret really shines, as opposed to a library with a similar API in JS or another common language? I really don't mean to be negative, but I just don't see how Ferret is any easier to use than something like Nightmare[0]. That said, I'm wondering if it's an issue of communication more than anything, so maybe a different example than the one in the readme would help.

[0]: https://github.com/segmentio/nightmare


You are fine, I totally understand your scepticism. And you are right, there are definitely issues in communication.

First of all, I built it for myself. I needed a high level representation of scraping logic, which would run an isolated and safe environment. Second, I needed to be able easily scrape dynamic pages.

So, what I got is: - high level, declarative-is language, that hides all infrastructural details, which helps you to focus on the logic itself. that helps you to describe what you want without worring about underlying technology. Today, I'm using headless Chrome, tomorrow I will use something else, but the change should not affect your code. - full support of dynamic pages. You can get data from dynamically rendered page, emulate user's actions and etc. Heck, you can even write bots with it. - embeddable. now, I have only CLI, there are plans to write a web server where you can save your scripts, schedule them and set up output streams.

But the main idea is to provide high level declarative way of scraping the web. I'm not saying you can't do that with other tools. I'm just trying to come up with something more easy to work with.

Regarding examples, the project is still WIP, so as more complex features I get, more complex examples I get. Here is more or less complex, getting data from Google Search. It's not that difficult, but it showcases the core feature of work with dynamic pages.

https://github.com/MontFerret/ferret/blob/master/docs/exampl...


"Much more effort"? Right now it implements a library and a language on top of that. Making it just be a library would cut the work in half.


The idea is to create a high level abstraction that represents your web scraping logic. The project is still WIP. I will create a web server which will help you to store your queries, schedule them and set up output stream to other systems like Spark and Flink.


Javascript isn't in uppercase, therefore it is an inferior language to write queries in. The best, most concise solution here is to reinvent javascript in uppercase, then pass it off as a new QL.


No one is responding to your comment BECAUSE IT ISN'T IN ALL CAPS TO MAKE THEM NOTICE.


The main purpose is to use scripts like SQL. Where you can write and modify your scripts for data without compilation. Plus, the project aims to simplify the process and hide technical complexity behind it. Moreover don't forget, that the system can work with dynamic pages which brings more complexity underneath.

And finally, you can use it as a library. It's totally embeddable.


It is also a library for the Go language.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: