We were discussing integrating Polar web archives along with ArchiveBox and maybe having some sort of standard to automatically submit these WARCs to the Internet Archive as part of your normal browsing activity.
Polar has a similar web capture feature but it's not WARC
WARC is probably the easiest standard for Polar to adopt. Right now we use HTML encoded in JSON objects.
When the user captures a web page we save all resources and store them in a PHZ file which you can keep as your own personal web archive.
What I'd like to eventually do is update our extension to auto-capture web pages so you could use Polar's cloud storage feature to basically store every page you've ever visited.
It really wouldn't be that much money per year. I did the math and it's about $50 per year to store your entire web history.
If I can get Polar over to WARC that would mean that tools like ArchiveBox and Polar could interop but we could do things like automatically send your documents you browse to the Internet Archive.
There's one huge problem though. What do we do about cookies and private data. I'm really not sure what to do there. It might be possible to strip this data for certain sites (news) without any risk of violating the users privacy.
Why don't they store everything as plain documents in a zip file and keep metadata in a json? Seems more future proof/easier for users to manipulate than WARC
Have you looked at the WARC format? It's ridiculously simple, basically concatenated raw HTTP requests and responses, with some extra HTTP metadata headers mixed in (a la extra JSON metadata keys). You can open it with a text editor. Very simple and efficient to manipulate, and very efficient to iterate over or generate.
Arguably the biggest problem is that it isn't complex enough: there is no index of contents built-in (the standard .csv-like index format of URL/timestamp/hash/offset is called CDX).
There aren't a ton of tools in the web archiving space in general, but almost all of the ones that do exist work with WARC. Existing tools (for interchange) include bulk indexing (for search, graph analysis, etc) and "replay" via web interface or browser-like application. Apart from specific centralized web archiving services that use WARC, there are several large public datasets, like Common Crawl, that are released in WARC format.
Yeah, the fact that I have to convert them after is just an unnecessary extra step for me, so I'll stick to wget and httrack for my archives. Once I mirror them I can just copy the files anywhere and browse them on any browser/device.
This is what Polar does but WARC is standardized... I think the argument against WARC though is that there isn't really much interchange. the point of standards is interchange IMO
I am always inclined to use standards, because of the "right tool for the job" mentality, and it seems like an even more obvious thing to do when the standard is widely adopted already. But let's be honest: there are lots of really crappy standards in use today, because at some point they were widely adopted, and now it seems like a too big job to pull off to replace them with something sane. Because it all depends on adoption by the end users, and the end users don't care about the formats, they care about the neat tools that make the life easier.
If POLAR gets popular enough, it can potentially dictate what is the standard. So I think it is best to compare the formats from the purely technical perspective. ZIP + json with as simple inner structure as possible has a huge advantage of being easy to handle by anyone: I can open (and actually read/modify) such a file with tools I will have on any machine, anytime. It is so simple and obvious, that I can write a script that packs some data in a format probably (at least partially) readable by your software in a minute.
So:
1. Can I (meaningfully) do the same with this WARC thing?
2. What are the technical benefits of WARC over this unnamed (yet) file format?
3. Can WARC be losslessly converted into *.polar? And the other way around?
4. Are there tools to do #3? Is it tricky to implement on a new platform?
I mean, if you can actually propose something better than WARC (or whatever) you can potentially save the world from another one WebDAV (or name your favorite horrible standard which we cannot rid of because everyone uses it).
I agree with the thrust of this and suggest taking it further by using SQLite for the underlying file format.
The advantages of SQLite are too numerous to list, and it has a built-in compressed format so there's no bloat to worry about vs. ZIP files.
As an example, this should make it practical, given a few iterations, to store multiple snapshots as deltas, deduplicating identical content. It also obviates having to base64 encode images and other binary assets.
One disadvantage is it introduces an application dependency. If the purpose of an archive is to preserve the data for future recall, years or decades later, having it in the most accessible format possible would be a priority.
A real problem with application dependency is that it makes it extremely hard to create a completely independent alternative. A good protocol should have multiple implementations with nothing in common otherwise you rely on specific implementation details. The infamous case of WebSQL should remind us of that.
I have spent the last hours reading up on everything-WARC that I could find but I still haven't been able to answer my main question: why only as external crawlers?
There does not seem to be a tool to actually capture a warc directly in your own browser session. webrecorder (http://webrecorder.io/) is the only example I could find that comes close in terms of user experience but it still requires a third party and different browsing habits
- are there browser extensions that can save a warc while you browse?
- are there API limitations that require external browser control? something browser extensions can't be used for?
- or is it simply a question of use case. And crawlers are more popular (for archiving) than locally recorded browser history (for search/analytics)?
edit:
I have now found https://github.com/machawk1/warcreate, related discussions in issues #111 and #112 are quite interesting. Looks like there are some serious limitations for browser extensions. I will look deeper into how webrecorder works and how this could be combined
When I initially coded up WARCreate, the webRequest API was still experimental. I believe there are more mature APIs that can be used from the extension context but some require DevTools to be visually open, which is not a common usage pattern of a typical web user.
Web archiving from a browser extension is difficult but can be improved. I don't know of any other approaches at trying to do this via a browser extension beyond submitting a URI to an archive.
This looks awesome. Is there an easy way to transfer my current chrome session / cookies to the archivebox chromium instance? Would love to my subscription websites (eg nytimes) to register my logged-in state and allow the capturing instance full logged-in access.
This sounds like a very good idea, but I'm having trouble making it work. For example, let's say I want to save a great website which will probably disappear soon (https://launchaco.com). I run `echo https://launchaco.com | ./archive` and then...? The generated index.html doesn't load css and js files. Or is this more for static content?
Is there some tool that would allow one to make a copy of a modern SPA? Is that even possible?
EDIT: I'm sad to see launchaco.com go, it would be a perfect tool for a project I'm working on. I don't mind paying, but I gather this is not possible anymore, and anyway, it might take some time for me to have everything ready...
> The generated index.html doesn't load css and js files. Or is this more for static content?
Why doesn't it do that? I thought that was the point of using Chrome as a headless browser to load all the dynamic elements into a final DOM so they could then be captured & serialized out: "ArchiveBox works by rendering the pages in a headless browser, then saving all the requests and fully loaded pages in multiple redundant common formats (HTML, PDF, PNG, WARC) that will last long after the original content dissapears off the internet."
It does that, it's possible it just broke on the page he tried. It doesn't work perfectly on 100% of pages which is why it saves as PDF, screenshot, and other methods as a fallback.
ArchiveBox is going to save all requests and responses with pywb in proxy mode, so it includes any data from the database that was needed to render the page.
Consider https://github.com/ArchiveTeam/grab-site (ArchiveTeam/grab-site) in the future. There is a difference between fetching/mirroring content and recording the http header request and responses when retrieving all objects for a site.
I believe there is even a docker container for a quick pull and run.
When archiving, you want the request and response headers and the content. Grab-site does that (any tool really that'll write WARC files). Sorry if my comment was ambiguous in that regard.
We intend to store both the WARC of the headless broswer render and the wget clone. (I'm the creator @pirate on Github). The idea is to archive anything in many redundant formats (HTML clone, WARC, PDF, etc) and for people to run this on a deduplicated filesystem like ZFS/BTRFS or disable specific methods if they care about saving space.
That's looks pretty nice! I have currently Wallabag running on a server (using it with a browser plugin to save pages) and it works pretty well. While Wallabag is strictly for websites as far as I understand, this seems to support more data (like bookmarks, history, etc.), which is great and I will certainly give it a try, luckily import from Wallabag is supported too.
That is great. I am currently archiving my pocket archive. I some times ago considered using my own wallabag instance, but too much work and maintenance. ArchiveBox worked directly locally.
Unfortunately, it seems, for some sites, it is archiving the GDPR warning or bot captcha instead of the content. For example, tumblr only gets me "Before you continue, an update from us" with an 'accept' button that does nothing; for Fastcompany the pdf output is overcast with a half page popup on each page; ouest-france gives me a "I am not a Robot" captcha...
Probably those issues are linked to the fact I do not use chromium though it is installed. When I understand how to archive the masked content, I intend to look at each screenshot to detect problems and re-archive them.
1. It used no ad-blocking. Hence the bot is happy to download all trackers and advertisements! Also trackers know my pocket bookmarks now...
2. Websites are ugly. In a 1440x2000 screenshot of a national geographic article, the top 1440x1500 pixels consist of black with just a grey round progress in the middle and the word "ADVERTISEMENT" at the top.
3. Websites are wasteful. Many warc archives clock in at 30 to 50 MB. Without downloading media, a single (article) webpage is 30MB!
4. While consulting the archive, my adblocker allows everything since the initial page is on file protocol. It seems archivebox does not rewrite the URLs in its html output (as wget's convert-links option would do). So trackers and ads galore when opening output.html and even archive/<timestamp>/index.html which automaticaly load the other files.
I was poking through the documentation - how does ArchiveBox generate the WARC file? I see the web page is archived in HTML, PNG and PDF using Chrome, but I don't think Chrome natively has the ability to create WARC files, does it?
Has anyone heard any news on Mozilla open sourcing Pocket? There are pieces that have been released but ultimately I would like to self host a Pocket archive similar to Wallabag
> Has anyone heard any news on Mozilla open sourcing Pocket?
All of Pocket? It's fairly straightforward to extract your Pocket data through their API (their export functionality leaves much to be desired unfortunately).
I love Mozilla and I'd support them open sourcing it. But I think they probably don't have much motivation to do so, as there are plenty of competing archiving tools already in the public ___domain, and their value-add as a paid archiving service would be reduced if anyone could run a white-label version of Pocket with one click.
There is an open issue on GitHub to add an API. Once that feature is implemented, adding a web page where links can be submitted for archival becomes trivial.
The short answer is that it can (usually) archive SPAs sucessfully. The long answer is that the wget archive method will download the JS, and in theory it should execute in the archived version just like it does on the normal site, but in practice it doesn't always work 100%. Luckily that's why ArchiveBox also saves a Chrome-rendered version of the page as a PDF, Screenshot, and DOM dump, so you should be able to archive most javascript-heavy content without too many problems.
Pip, apt, & homebrew packages are the install methods we're moving towards currently. I just caved to user demand in the short-term and added a helper script about a year ago as a crutch until we finished releasing the real packages.
All browsers should aggressively cache everything they get and bypass all "anti-cache" inanities. There are addons that modify response headers but this is all just filthycasulness, everything is hostile to the user--just as it should be. If people do not get exsanguinated violently enough they feel really antsy. All data is to be immediately uploaded to Tor+IPFS. The browser is meant to "play coy" and act as though it did respect the headers in online mode, but shift to offline and it loads the entire history fully. This would work well with Tor Browser. Same thing should apply for all videos of course. To use such tools would break the anonymity set of people, which is why after each TBB close the entire cache2 folder should be exported to a file that can then be imported offline-only. A simple way to do this would be to simply copy the entire tor-browser directory and then reopen it in work-offline. The problem is the anti-cache websites are then lost, so just like JavaScript, refuse to use websites that use anti-cache mechanisms. The user is not supposed to feel like everything can get annihilated at any time: this level of hostility towards her will not go unpunished.
Generally whoever takes anything down harms the Noosphere and should be viewed as an enemy of Posthumanism. Steaks on the table by choice and consent--treat them cruelly and without mercy. The Sibyl System will show no mercy on those who have ever forced users to enable JavaScript or prevented IA from archiving their pages.
ArchiveBox uses WARC as it's backing store:
https://en.wikipedia.org/wiki/Web_ARChive
which is nice because it's standardized.
We were discussing integrating Polar web archives along with ArchiveBox and maybe having some sort of standard to automatically submit these WARCs to the Internet Archive as part of your normal browsing activity.
Polar has a similar web capture feature but it's not WARC
https://getpolarized.io/
(yet)...
WARC is probably the easiest standard for Polar to adopt. Right now we use HTML encoded in JSON objects.
When the user captures a web page we save all resources and store them in a PHZ file which you can keep as your own personal web archive.
What I'd like to eventually do is update our extension to auto-capture web pages so you could use Polar's cloud storage feature to basically store every page you've ever visited.
It really wouldn't be that much money per year. I did the math and it's about $50 per year to store your entire web history.
If I can get Polar over to WARC that would mean that tools like ArchiveBox and Polar could interop but we could do things like automatically send your documents you browse to the Internet Archive.
There's one huge problem though. What do we do about cookies and private data. I'm really not sure what to do there. It might be possible to strip this data for certain sites (news) without any risk of violating the users privacy.