The OP probably meant this tool as something useful in his situation and shared it with others, in case someone else needs something like that. He was probably thinking along the lines of "If someone needs that, it's fine, let them discover something like that exists and use it". People over here are viewing this as a one-size-fits-all solution for web scraping and pointing out valid reasons why it's a bad idea to use it like that. I think that we should accept that this tool might be good for some, but completely unnecessary for others and we shouldn't criticize it for not being useful for our purposes.
One situation where this tool has clear advantages over other solutions is client-side scraping. If we made an app for ios/android/Windows/whatever that runs on devices owned by end-users and crawls data upon request, perhaps from multiple websites, having the crawlers written as external scripts would be extremely useful. That would allow you tu push updates to the crawlers separately, immediately after a website changes its layout, without the need to update your app. Making a gallery of downloadable crawlers for more sources would, probably, aslo be possible. The limitations of that language are very advantageous in this situation, as crawlers are mostly sandboxed and can't destroy your filesystem, steal your data etc. This tool also allows keeping the crawlers separate from the app. That would allow people to create a global npm-like repository for crawlers working regardless of what programming language you use (provided someone wrote an implementation of this tool in your language). Imagine an use-case like building a books price comparator app in, let's say, Java for android and swift for ios, and maybe even c++ because some libraries still run Windows xp and would like that app to be available, and being able to download the crawlers for Amazon and tens of local bookselling websites that would work in all of those apps, without a need to write them yourself. If used right, this tool could actually allow programmers to imagine that there's actually a semantic web as originally imagined and write services that interact with various websites in surprising ways, without thinking about how the interactions are done on a lower level.
I'm sorry but there is nothing new here? This seems like a backwards step if anything.
Usually when web scraping, I can just load in the HtmlAgilityPack (c#), point it at a URL then write some functional code to extract the necessary data.
Even better, I'll examine the website in Fiddler and hope they have a data-view separation going on, and be able to just intercept the json file they load instead.
Worse case scenario I need to dynamically click on buttons etc, but this can usually be handled by selenium, or if they detect that just roll a custom implementation of CefSharp (again not hard, just download the nuget, and it lets you run your own custom javascript).
A new, more limited, language (with no IDE tools) is not the way to go. If anything a better web scraper just make the above processes I mentioned more seamless, for example combining finding/selecting of elements in chrome with codegen.
The main advantage of this over your approach with HtmlAgilityPack is that Ferret can handle dynamic web pages - those that are rendered with JS.
And also, it can emulate user interactions.
But anyway, thanks for your feedback :)
The code for doing this isn't too difficult with https://github.com/chromedp/chromedp, is this just some helpers around that? I haven't used it or puppeteer on the node side that heavily, but what have you found difficult that deserves this kind of wrapper/abstraction instead of direct library use?
I'm sorry I do not fully understand what you mean.
Imagine, that you need to grab some data from SoundCloud and, also imagine, they do not have a public API :)
How would you do it without launching a browser?
You just need to look at the packets in Fiddler as the page loads, find the request that gets the data you want, and then clone that request in your application.
I just took a look at soundcloud and was able to get the data within a minute, as it's a basic setup. They use a json structure (most websites do) [1], and the data is fetched with a simply GET request.
If the website requires authentication, you just need to clone those requests too, and then you'll get some sort of cookie/session id back which you attach to any future requests.
As a bonus: You can then throw the json into a code converter online too which will convert your structure to a Type as well (useful if you are using a static language).
This is so 2010...websites now use a mix of serverside and client side. You can't just watch http requests and figure out the "api" specifications. Just try to scarpe adwords/gmail/amazon etc. Consider that they may also use anti-scarping code on the client.
Even without a documented API, there is often one at play using websites.
Example: Website is available at "http://example.com/books/1234". When loading it, you see that it fires a request to "http://example.com/api/booksdata/1234" to load the data that popluates the page. So now you don't have to use a slow browser that loads everything, but can just use your normal http client (for all the ids you know).
I think what's being noted is that when the data comes as a data structure on the page, or as data passed back from an XHR request, you can just use that data directly and there's less page scraping to be done. This is generally how dynamic pages are created, out of a shipped data structure and rules to create the page out of it. If you have the data structure, it's generally much easier to parse than the page generated from it.
That said, for pages that use a background request to fetch the data, this can be useful, as that data used to build the page isn't always kept around as a data structure (at least not one easily accessible) afterwards. That is if accessing the endpoint of the background request isn't feasible for some reason.
I think I have never seen so many negative comments before. Yes, OPs idea might be a bit unnecessary but it surely looks interesting. It could mature to a very interesting project. And depending on your business, if you rely heavily on crawling, having a specific language for it might help with making code more uniform
I don't think there is much advantage in being declarative in this case because whenever I need to scrape stuff I have to do a lot of edge case handling and I want to be in control, but it does seem to be a recurring dream to make some sort of web query language (based on my memories of Old Dr. Dobbs issues)
The flow in the example is simple: click -> wait -> process result; but it's still there. Notice that instructions are executed in the order they are defined, that is enough to make it imperative.
yes but I think it could be made truly declarative with very little work.
take out the waits, and maybe not carry around the doc all over the place - the context of the page being processed should be figured out by the interpreter.
on edit: although the web interfaces being what they are some things need to be order dependent - like
INPUT(google, 'input[name="q"]', "ferret")
CLICK(google, 'input[name="btnK"]'), I mean you need to click the button after you fill out the input.
Yeah I opened the page kind of hoping for something GraphQL. Scraping libs are cool but I have absolutely no idea why this is at the top of Hacker News. It's not declarative, it's unnecessarily a language instead of just a library, the only thing it has that HN would love is that it's written in Go and it's kind of an esoteric niche. Is that really all it takes?
I was expecting an ML-driven framework where you write the HTML you want to scrape, and the framework diffs the trees and attempts to extract the information from the target tree as best it can to match your input tree. That's what pops into mind when I think of "declarative" scraping.
LET google = DOCUMENT("https://www.google.com/", true)
INPUT(google, 'input[name="q"]', "ferret")
CLICK(google, 'input[name="btnK"]')
WAIT_NAVIGATION(google)
LET result = (
FOR result IN ELEMENTS(google, '.g')
RETURN {
title: ELEMENT(result, 'h3 > a'),
description: ELEMENT(result, '.st'),
url: ELEMENT(result, 'cite')
}
)
RETURN (
FOR page IN result
FILTER page.title != NONE
RETURN page
)
Am I missing something here? I don't see anything declarative about the the first one over the second; both of these look identical and rather imperative to me. Is "declarative" becoming a buzzword (thanks to React, maybe?), or am I missing something?
This looks all wrong. Page scraping is not accessing a data source in a way that means a query language makes any sense. The moment you need to interact with the page and admit there's a dom under there it breaks the idiom.
And why it's remaking variable declaration I don't know, and why is the for loop so verbose? If you insist on a query language go the whole way and remove repetition and syntax complexity because that's the only thing that could actually add value.
DOM is a representation of some data. Which means, you can extrapolate the data and then manipulate it.
The language itself has nothing related to the DOM. All DOM operations are implemented via functions from standard library.
"Good artist copy, great artist steal"
I'm trying to be a good artist trying to not invent a new brand language (I'm not that smart), so I just picked up (copied) an existing one that fits better for dealing with complex structures like trees.
So it is AQL - ArangoDB Query Language. https://docs.arangodb.com/3.3/Manual/
If you have any suggestions how to improve the language - you are very welcome.
How about using an existing language, like Python? You can make a really great DSL using Python, and then people have access to all the other Python language features that they already know, and the stdlib that they already know, and 3rd-party modules they already know..
I could, if I knew Python pretty well :)
But I've done it in the way I needed it to be done.
I wanted to have an isolated and safe environment that would allow me to easily scrape the web without dealing with infrastructural code.
I think this could go further in terms of making it declarative.
A simple declarative approach could taking this:
LET google = DOCUMENT("https://www.google.com/", true)
and instead of thinking about it as an action (get this page), think about it as giving you an object. The result is a tuple of the URL, the time fetched, and maybe other information (like User-Agent). This helps with exploratory scraping, where you want to be able to repeat actions without always re-fetching the documents. And you'll be constructing a program, unlike a REPL where you always write the program top-to-bottom, including all your intermediate bugs.
Changing DOCUMENT() is easy enough. Things like CLICK() are a bit harder, though if you extend the data structures you can have a document that is the result of clicking a certain element in a certain previous document. Again to do it the first time you have to actually DO the action, but later on perhaps not. And you'll be constructing interstitial objects that are great for debugging.
Then what could make it feel really declarative is having more than one presentation of an execution. You can package up a scraping, and then you can answer questions about WHY you ended up with certain results.
Well, that's what I'm saying... right now, making it represent an open browser tab with a specific state and where everything DOES something isn't declarative. But it could be declarative if you changed how those commands are implemented.
Or, to phrase it another way: if the program represents a PLAN then it's declarative. If it represents a series of things to DO then it's imperative. It seems like it's doing things, but it could plan things with the same syntax.
Oh yes. The reason if this is that for now the language itself is DOM agnostic, it's just a port of an existing one. (https://docs.arangodb.com/3.4/AQL/) .
So, the entire DOM thing is implemented by standard library which is pluggable.
In the future, I might extend the language to make it less DOM agnostic by introducing new keywords for dealing with that. But for now you have to move document object around. Which is not that bad, because you may open as many page as you want in a single query.
I don't see why web scraping should be declarative at all. XPath is declarative, and hard to use. When a human browses a website, they do one thing at a time. That is inherently imperative. A DSL for highly-imperative "human-style" web scraping is a nice idea in my opinion. That's exactly what Ferret appears to be.
In late 2000 at the tail end of the bubble bursting, there was a search engine company trying to build on a platform of push notifications instead of crawling.
I know that at my current company the plurality if traffic comes from crawlers. We don’t want to throttle them because that’s biting the hand that feeds but it sure sucks.
I wonder often, how many crawlers you need before it’s cheaper for a website to volunteer up when pages change or new ones arrive.
There are N existing standards for websites to notify crawlers about new content. Few websites use them, and the ones that do are often buggy. A few are even malicious.
Parsing the html or traversing DOM is the easy part. Doing request queues, ip rotation, data quality management, exponential backoff etc. on scale is much harder.
If you're looking for a slightly-more-native way to declaratively scrape (the data-binding aspect, at least) in Go based on CSS selectors, I wrote a wrapper around GoQuery that just uses struct tags for mapping. You don't have to learn a new language, and it should feel familiar if you've ever written CSS or jQuery. I've found it helps reduce a lot of boilerplate and makes things a lot easier to come back and read than what was previously a lot of nested GoQuery functions, selectors, etc.
Web scraping "at scale" ends up being a lot more complicated than blinding firing HTTP requests. Scrapy, for example, supports arbitrary "middleware" that can, for example, follow 301 Redirect, respect robots.txt files, follow sitemap.xml files, etc.
To what extent is this supported (or, to what extent do you plan to support it?) Similarly, since the front-end language is essentially a compiler, would it be possible to write an alternative "backend" (e.g. something that distributes requests across a cluster)?
This package is more like a runtime. There are plans to create a dedicated server, where you would be able to store your queries, schedule them and set up output streams like Spark or Flink.
For now, it does not respect robots.txt. But it can be easily added.
Out of the box, there are not scaling mechanism yet, since the project is WIP. But, it's written in Go, which makes it pretty fast.
One idea of how you could scale it is to run cluster of instances of headless Chrome, put proxy/load balancer in front of it, and get Ferret a url to the cluster. It will treat it as a single instance of Chrome. The only problem, you would need to differentiate request from CDP (Chrome DevTools Protocol) client, and once a page is open, redirect all related requests to the same Chrome instance.
Some sites are so broken that the only way to parse some of their HTML is to use regexps or substring searches on the source code. Maybe this tool can handle that through extensions.
One of the goals of the project is to hide technical details and complexities that follow modern web scraping, especially when you deal with dynamic web pages.
Yes, puppeteer and ferret use same technology under the hood - Chrome DevTools Protocol.
But, the purpose of the project is not to "be better than". No, the purpose is to let developers focus on the data itself, without dealing with technical details of underlying technologies.
You can treat it as a higher abstraction on top of puppeteer / CDP.
It does not really matter whether it's written in JS or Go. main goal is to simplify data scraping and make it more declarative (even though, some people say it's not a declarative language :))
It's not so interesting to me that people continue to feel the need to build a scraping framework (it's an excellent beginner project because it encompasses so much of web development) but why HackerNews finds scraping frameworks so interesting. There seems to be a scraping framework at the top of HN every week or two.
Personal theory, as someone who has done a good bit of scraping over the last decade:
Extremely common problem space, with a lot of tantalizing opportunities for a "platform" or "shared language" where one doesn't exist. I see existing tooling like BeautifulSoup, Scrapy, Selenium, as isomorphisms to the "near hardware level" of our problem-space-tool-stack, whereas the abstractions and higher level logic is often defined per the use case.
On top of (or perhaps because of) that, one often writes a lot of boilerplate, but when it comes to genericize, one often ends up writing the tool that genericizes within their problem space, and for all the tantilizing opportunities, the number of "not quite fully intersecting scraping problem spaces" (and associated tradeoffs/different paradigms) is far more massive than I considered when I started any of my own scraping tools.
This has lead me to take a very opinionated view with my own tooling, wherein I build for _ONE SPECIFIC RECURRING SCRAPING PRIMITIVE_. (in my case, treating the whole world as a stream of events. You want something that can be more or less massaged into that? Cool; maybe something for you. If not? Probably want to look at a different set of tools)
I think comments here are unfairly harsh. I really like this innovative idea of having a dedicated language. If it can see client-side rendered HTML (e.g. React, Vue, etc...), that would be a whole another level for me since I don't think this has been made before.
I feel the same here, architecturally I've never liked scrapers which are tightly coupled within a service. Websites are prone to sudden change and it seems unsustainable to redeploy each time a selector changes, it's pretty innovative to leverage the power of *QL for this sort of problem.
The best way to scrape already for many years is using a headless browser plugin. For example phantomjs with nodejs. That in combination with tor or a large proxy pool is unbeatable by all other alternatives.
I'm not sure why you're being downvoted. As far as I can tell from the examples, there is nothing that this language brings to the table that couldn't be implemented instead as an API on top of an existing language.
The "funny" part is, while reading this, I kept thinking of a similar tool I built myself as a wrapper around python+beautifulsoup. Definitely parts of what the OP needed were compelling to me (I found regular subpatterns in my scraping work that could be encapsulated really well in certain bits of syntactic sugar, but concluded that a minimal json-like structure for defining a scrape was both sufficient and let me have a graphical "scrape builder" in a UX far more readily than if I actually wrote the scrape as code.
There's the usual amount of HN cynicism in the thread which I'm not sure is 200% off mark, but I think there are some good concepts in "Scraping primitives" that can be contemplated that the OP took an interesting angle on. (or rather; not the angle I took, so interesting to me.)
If other people would find it beneficial it can be; I had admittedly seen it as "just a mess of sugar to help me scrape easier" (with all the typical nervousness of showing others "imperfect" code). I'm aptly in the process of cleaning it up, writing tests and finishing an MVP UI. I can do a show HN in a few weeks once I've found the time to get it ship-shape.
Do you have a more involved example where Ferret really shines, as opposed to a library with a similar API in JS or another common language? I really don't mean to be negative, but I just don't see how Ferret is any easier to use than something like Nightmare[0]. That said, I'm wondering if it's an issue of communication more than anything, so maybe a different example than the one in the readme would help.
You are fine, I totally understand your scepticism. And you are right, there are definitely issues in communication.
First of all, I built it for myself. I needed a high level representation of scraping logic, which would run an isolated and safe environment.
Second, I needed to be able easily scrape dynamic pages.
So, what I got is:
- high level, declarative-is language, that hides all infrastructural details, which helps you to focus on the logic itself. that helps you to describe what you want without worring about underlying technology. Today, I'm using headless Chrome, tomorrow I will use something else, but the change should not affect your code.
- full support of dynamic pages. You can get data from dynamically rendered page, emulate user's actions and etc. Heck, you can even write bots with it.
- embeddable. now, I have only CLI, there are plans to write a web server where you can save your scripts, schedule them and set up output streams.
But the main idea is to provide high level declarative way of scraping the web. I'm not saying you can't do that with other tools. I'm just trying to come up with something more easy to work with.
Regarding examples, the project is still WIP, so as more complex features I get, more complex examples I get.
Here is more or less complex, getting data from Google Search. It's not that difficult, but it showcases the core feature of work with dynamic pages.
The idea is to create a high level abstraction that represents your web scraping logic.
The project is still WIP. I will create a web server which will help you to store your queries, schedule them and set up output stream to other systems like Spark and Flink.
Javascript isn't in uppercase, therefore it is an inferior language to write queries in. The best, most concise solution here is to reinvent javascript in uppercase, then pass it off as a new QL.
The main purpose is to use scripts like SQL. Where you can write and modify your scripts for data without compilation.
Plus, the project aims to simplify the process and hide technical complexity behind it.
Moreover don't forget, that the system can work with dynamic pages which brings more complexity underneath.
And finally, you can use it as a library. It's totally embeddable.
The OP probably meant this tool as something useful in his situation and shared it with others, in case someone else needs something like that. He was probably thinking along the lines of "If someone needs that, it's fine, let them discover something like that exists and use it". People over here are viewing this as a one-size-fits-all solution for web scraping and pointing out valid reasons why it's a bad idea to use it like that. I think that we should accept that this tool might be good for some, but completely unnecessary for others and we shouldn't criticize it for not being useful for our purposes.
One situation where this tool has clear advantages over other solutions is client-side scraping. If we made an app for ios/android/Windows/whatever that runs on devices owned by end-users and crawls data upon request, perhaps from multiple websites, having the crawlers written as external scripts would be extremely useful. That would allow you tu push updates to the crawlers separately, immediately after a website changes its layout, without the need to update your app. Making a gallery of downloadable crawlers for more sources would, probably, aslo be possible. The limitations of that language are very advantageous in this situation, as crawlers are mostly sandboxed and can't destroy your filesystem, steal your data etc. This tool also allows keeping the crawlers separate from the app. That would allow people to create a global npm-like repository for crawlers working regardless of what programming language you use (provided someone wrote an implementation of this tool in your language). Imagine an use-case like building a books price comparator app in, let's say, Java for android and swift for ios, and maybe even c++ because some libraries still run Windows xp and would like that app to be available, and being able to download the crawlers for Amazon and tens of local bookselling websites that would work in all of those apps, without a need to write them yourself. If used right, this tool could actually allow programmers to imagine that there's actually a semantic web as originally imagined and write services that interact with various websites in surprising ways, without thinking about how the interactions are done on a lower level.