ArchiveBox: Open-source self-hosted web archive

burtonator · on March 10, 2019

I actually talked to the author of ArchiveBox about 1-2 weeks ago. Nice guy and it's good that he's also friendly with the Internet Archive.

ArchiveBox uses WARC as it's backing store:

https://en.wikipedia.org/wiki/Web_ARChive

which is nice because it's standardized.

We were discussing integrating Polar web archives along with ArchiveBox and maybe having some sort of standard to automatically submit these WARCs to the Internet Archive as part of your normal browsing activity.

Polar has a similar web capture feature but it's not WARC

https://getpolarized.io/

(yet)...

WARC is probably the easiest standard for Polar to adopt. Right now we use HTML encoded in JSON objects.

When the user captures a web page we save all resources and store them in a PHZ file which you can keep as your own personal web archive.

What I'd like to eventually do is update our extension to auto-capture web pages so you could use Polar's cloud storage feature to basically store every page you've ever visited.

It really wouldn't be that much money per year. I did the math and it's about $50 per year to store your entire web history.

If I can get Polar over to WARC that would mean that tools like ArchiveBox and Polar could interop but we could do things like automatically send your documents you browse to the Internet Archive.

There's one huge problem though. What do we do about cookies and private data. I'm really not sure what to do there. It might be possible to strip this data for certain sites (news) without any risk of violating the users privacy.

0xbadcafebee · on March 10, 2019

Why don't they store everything as plain documents in a zip file and keep metadata in a json? Seems more future proof/easier for users to manipulate than WARC

bnewbold · on March 10, 2019

Have you looked at the WARC format? It's ridiculously simple, basically concatenated raw HTTP requests and responses, with some extra HTTP metadata headers mixed in (a la extra JSON metadata keys). You can open it with a text editor. Very simple and efficient to manipulate, and very efficient to iterate over or generate.

https://iipc.github.io/warc-specifications/specifications/wa...

Arguably the biggest problem is that it isn't complex enough: there is no index of contents built-in (the standard .csv-like index format of URL/timestamp/hash/offset is called CDX).

There aren't a ton of tools in the web archiving space in general, but almost all of the ones that do exist work with WARC. Existing tools (for interchange) include bulk indexing (for search, graph analysis, etc) and "replay" via web interface or browser-like application. Apart from specific centralized web archiving services that use WARC, there are several large public datasets, like Common Crawl, that are released in WARC format.

0xbadcafebee · on March 10, 2019

Yeah, the fact that I have to convert them after is just an unnecessary extra step for me, so I'll stick to wget and httrack for my archives. Once I mirror them I can just copy the files anywhere and browse them on any browser/device.

burtonator · on March 10, 2019

This is what Polar does but WARC is standardized... I think the argument against WARC though is that there isn't really much interchange. the point of standards is interchange IMO

krick · on March 10, 2019

I am always inclined to use standards, because of the "right tool for the job" mentality, and it seems like an even more obvious thing to do when the standard is widely adopted already. But let's be honest: there are lots of really crappy standards in use today, because at some point they were widely adopted, and now it seems like a too big job to pull off to replace them with something sane. Because it all depends on adoption by the end users, and the end users don't care about the formats, they care about the neat tools that make the life easier.

If POLAR gets popular enough, it can potentially dictate what is the standard. So I think it is best to compare the formats from the purely technical perspective. ZIP + json with as simple inner structure as possible has a huge advantage of being easy to handle by anyone: I can open (and actually read/modify) such a file with tools I will have on any machine, anytime. It is so simple and obvious, that I can write a script that packs some data in a format probably (at least partially) readable by your software in a minute.

So:

1. Can I (meaningfully) do the same with this WARC thing?

2. What are the technical benefits of WARC over this unnamed (yet) file format?

3. Can WARC be losslessly converted into *.polar? And the other way around?

4. Are there tools to do #3? Is it tricky to implement on a new platform?

I mean, if you can actually propose something better than WARC (or whatever) you can potentially save the world from another one WebDAV (or name your favorite horrible standard which we cannot rid of because everyone uses it).

samatman · on March 10, 2019

I agree with the thrust of this and suggest taking it further by using SQLite for the underlying file format.

The advantages of SQLite are too numerous to list, and it has a built-in compressed format so there's no bloat to worry about vs. ZIP files.

As an example, this should make it practical, given a few iterations, to store multiple snapshots as deltas, deduplicating identical content. It also obviates having to base64 encode images and other binary assets.

mysterydip · on March 10, 2019

One disadvantage is it introduces an application dependency. If the purpose of an archive is to preserve the data for future recall, years or decades later, having it in the most accessible format possible would be a priority.

aloer · on March 10, 2019

Remembered I read something about this some time ago, looked it up:

SQLite is a recommended storage format (https://www.sqlite.org/locrsf.html) by the library of congress

> As of this writing (2018-05-29) the only other recommended storage formats for datasets are XML, JSON, and CSV

wumpus · on March 10, 2019

The Library of Congress stores its web archives as WARCs.

mysterydip · on March 11, 2019

Good to know, thanks!

rakoo · on March 10, 2019

Very true, but I'd say SQlite is the one application dependency that could have a pass:

- they promise support until at least 2050 (https://sqlite.org/lts.html)

- if a promise isn't enough, SQlite is to be supported for the entire lifetime of the Airbus A350 airframe (https://mobile.twitter.com/copiousfreetime/status/6758345433...). I would assume Airbus will be paying for it.

A real problem with application dependency is that it makes it extremely hard to create a completely independent alternative. A good protocol should have multiple implementations with nothing in common otherwise you rely on specific implementation details. The infamous case of WebSQL should remind us of that.

mysterydip · on March 11, 2019

Also excellent points I waan't aware of, thanks!

wumpus · on March 10, 2019

There actually is interchange of WARCs in the web archiving community.

nikisweeting · on March 11, 2019

Hi Kevin!

Thanks for chiming in here! I only just saw ArchiveBox hit HN while I was away for the weekend. Just responding to everything now...

(@everyone in this thread, go check out Polar, it's awesome!)

aloer · on March 10, 2019

I have spent the last hours reading up on everything-WARC that I could find but I still haven't been able to answer my main question: why only as external crawlers?

There does not seem to be a tool to actually capture a warc directly in your own browser session. webrecorder (http://webrecorder.io/) is the only example I could find that comes close in terms of user experience but it still requires a third party and different browsing habits

- are there browser extensions that can save a warc while you browse?

- are there API limitations that require external browser control? something browser extensions can't be used for?

- or is it simply a question of use case. And crawlers are more popular (for archiving) than locally recorded browser history (for search/analytics)?

edit:

I have now found https://github.com/machawk1/warcreate, related discussions in issues #111 and #112 are quite interesting. Looks like there are some serious limitations for browser extensions. I will look deeper into how webrecorder works and how this could be combined

machawk1 · on March 11, 2019

When I initially coded up WARCreate, the webRequest API was still experimental. I believe there are more mature APIs that can be used from the extension context but some require DevTools to be visually open, which is not a common usage pattern of a typical web user.

Per the ticket, we have worked on a few other JavaScript-driven web preservation project like https://github.com/N0taN3rd/node-warc and https://github.com/N0taN3rd/Squidwarc, among others.

Web archiving from a browser extension is difficult but can be improved. I don't know of any other approaches at trying to do this via a browser extension beyond submitting a URI to an archive.

nikisweeting · on March 12, 2019

Woah both these tools are awesome, thanks for sharing them!

I'm adding links to both from https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Comm...

EamonnMR · on March 10, 2019

Definitely report back if you find anything, I would be very interested.

jalopy · on March 9, 2019

This looks awesome. Is there an easy way to transfer my current chrome session / cookies to the archivebox chromium instance? Would love to my subscription websites (eg nytimes) to register my logged-in state and allow the capturing instance full logged-in access.

EDIT: Nevermind, should have RTFM - see CHROME_USER_DATA_DIR in https://github.com/pirate/ArchiveBox/wiki/Configuration.

This looks really, really awesome.

amenod · on March 9, 2019

This sounds like a very good idea, but I'm having trouble making it work. For example, let's say I want to save a great website which will probably disappear soon (https://launchaco.com). I run `echo https://launchaco.com | ./archive` and then...? The generated index.html doesn't load css and js files. Or is this more for static content?

Is there some tool that would allow one to make a copy of a modern SPA? Is that even possible?

EDIT: I'm sad to see launchaco.com go, it would be a perfect tool for a project I'm working on. I don't mind paying, but I gather this is not possible anymore, and anyway, it might take some time for me to have everything ready...

gwern · on March 9, 2019

> The generated index.html doesn't load css and js files. Or is this more for static content?

Why doesn't it do that? I thought that was the point of using Chrome as a headless browser to load all the dynamic elements into a final DOM so they could then be captured & serialized out: "ArchiveBox works by rendering the pages in a headless browser, then saving all the requests and fully loaded pages in multiple redundant common formats (HTML, PDF, PNG, WARC) that will last long after the original content dissapears off the internet."

nikisweeting · on March 16, 2019

It does that, it's possible it just broke on the page he tried. It doesn't work perfectly on 100% of pages which is why it saves as PDF, screenshot, and other methods as a fallback.

narag · on March 10, 2019

Is there some tool that would allow one to make a copy of a modern SPA?

If the interesting content is inside a database at the backend, copying the interface is not enough. You also want the database.

nikisweeting · on March 16, 2019

ArchiveBox is going to save all requests and responses with pywb in proxy mode, so it includes any data from the database that was needed to render the page.

WrtCdEvrydy · on March 9, 2019

Damn, this actually looks nice. I wonder if we can archive.

platz · on March 9, 2019

This looks very nice

As an aside,

   wget -r -k -np

works surprisingly well for my offline needs.

For permanent access I defer to archive.is

toomuchtodo · on March 9, 2019

Consider https://github.com/ArchiveTeam/grab-site (ArchiveTeam/grab-site) in the future. There is a difference between fetching/mirroring content and recording the http header request and responses when retrieving all objects for a site.

I believe there is even a docker container for a quick pull and run.

j88439h84 · on March 9, 2019

When I want to archive a site, don't I want the content, not just the headers? How do I save the content?

toomuchtodo · on March 9, 2019

When archiving, you want the request and response headers and the content. Grab-site does that (any tool really that'll write WARC files). Sorry if my comment was ambiguous in that regard.

https://en.wikipedia.org/wiki/Web_ARChive

https://www.loc.gov/preservation/digital/formats/fdd/fdd0002...

If you wanted to write WARC files out with wget, you'd use the options specified in https://www.archiveteam.org/index.php?title=Wget_with_WARC_o...

progval · on March 10, 2019

Depends on what you want to use your archive for. If it's just for opening it locally in a web browser, mirroring the content is enough.

toomuchtodo · on March 10, 2019

Yes, but if the effort to go the WARC route is minimal above mirroring, might as well do it. You never know if you’ll need data you didn’t grab.

nikisweeting · on March 11, 2019

We intend to store both the WARC of the headless broswer render and the wget clone. (I'm the creator @pirate on Github). The idea is to archive anything in many redundant formats (HTML clone, WARC, PDF, etc) and for people to run this on a deduplicated filesystem like ZFS/BTRFS or disable specific methods if they care about saving space.

You can track the progress on implementing WARC saving for all archived resources here: https://github.com/pirate/ArchiveBox/issues/130

ohlookabird · on March 10, 2019

That's looks pretty nice! I have currently Wallabag running on a server (using it with a browser plugin to save pages) and it works pretty well. While Wallabag is strictly for websites as far as I understand, this seems to support more data (like bookmarks, history, etc.), which is great and I will certainly give it a try, luckily import from Wallabag is supported too.

ernesth · on March 10, 2019

That is great. I am currently archiving my pocket archive. I some times ago considered using my own wallabag instance, but too much work and maintenance. ArchiveBox worked directly locally.

Unfortunately, it seems, for some sites, it is archiving the GDPR warning or bot captcha instead of the content. For example, tumblr only gets me "Before you continue, an update from us" with an 'accept' button that does nothing; for Fastcompany the pdf output is overcast with a half page popup on each page; ouest-france gives me a "I am not a Robot" captcha...

Probably those issues are linked to the fact I do not use chromium though it is installed. When I understand how to archive the masked content, I intend to look at each screenshot to detect problems and re-archive them.

ernesth · on March 10, 2019

Urgh. I had not thought about some ill effects:

1. It used no ad-blocking. Hence the bot is happy to download all trackers and advertisements! Also trackers know my pocket bookmarks now...

2. Websites are ugly. In a 1440x2000 screenshot of a national geographic article, the top 1440x1500 pixels consist of black with just a grey round progress in the middle and the word "ADVERTISEMENT" at the top.

3. Websites are wasteful. Many warc archives clock in at 30 to 50 MB. Without downloading media, a single (article) webpage is 30MB!

4. While consulting the archive, my adblocker allows everything since the initial page is on file protocol. It seems archivebox does not rewrite the URLs in its html output (as wget's convert-links option would do). So trackers and ads galore when opening output.html and even archive/<timestamp>/index.html which automaticaly load the other files.

CMCDragonkai · on March 10, 2019

Can archived content be put on a public permalink? IPFS.

JustARandomGuy · on March 9, 2019

I was poking through the documentation - how does ArchiveBox generate the WARC file? I see the web page is archived in HTML, PNG and PDF using Chrome, but I don't think Chrome natively has the ability to create WARC files, does it?

wipseabusbus · on March 9, 2019

Semes like the --warc-file= argument for wget https://github.com/pirate/ArchiveBox/blob/a74d8410f4c3f2f08f...

detaro · on March 9, 2019

probably https://github.com/webrecorder/pywb

thrilleratplay · on March 10, 2019

Has anyone heard any news on Mozilla open sourcing Pocket? There are pieces that have been released but ultimately I would like to self host a Pocket archive similar to Wallabag

toomuchtodo · on March 10, 2019

> Has anyone heard any news on Mozilla open sourcing Pocket?

All of Pocket? It's fairly straightforward to extract your Pocket data through their API (their export functionality leaves much to be desired unfortunately).

nikisweeting · on March 11, 2019

I love Mozilla and I'd support them open sourcing it. But I think they probably don't have much motivation to do so, as there are plenty of competing archiving tools already in the public ___domain, and their value-add as a paid archiving service would be reduced if anyone could run a white-label version of Pocket with one click.

midnightdiesel · on March 9, 2019

This is great. I was just lamenting the lack of a nice tool like this, and I’m looking forward to seeing this develop.

walterbell · on March 9, 2019

If this were running on a self-hosted server, could it be invoked from an iOS shortcut on phone/tablet?

sah2ed · on March 9, 2019

Potentially, yes.

There is an open issue on GitHub to add an API. Once that feature is implemented, adding a web page where links can be submitted for archival becomes trivial.

WrtCdEvrydy · on March 9, 2019

ArchiveBox UI coming soon

nikisweeting · on March 11, 2019

Yup, I actually do this right now using Shortcuts.app on my phone to SSH into my server and pipe my latest bookmarks from Pocket into ArchiveBox!

It's a nice way to update my archive with one tap from my phone.

walterbell · on March 11, 2019

Could the intermediate step via Pocket be avoided, e.g. could the URL be passed directly from iOS to ArchiveBox?

nikisweeting · on March 12, 2019

Yes, easily. Just append the URL to a file with SSH and pipe the file into ArchiveBox, or pipe the url directly into ArchiveBox via stdin.

5_minutes · on March 9, 2019

So this is a direct pinboard comp?

Would be Nice to save enitre pages without paying extra

dtagames · on March 9, 2019

This sounds terrific! Thank you for building it!

nikisweeting · on March 11, 2019

You're welcome <3

Let me know if you have any questions! (I'm @pirate on Github)

ausjke · on March 10, 2019

what about SPA sites that javascript renders the html, if this can archive SPA site that will be great.

nikisweeting · on March 11, 2019

The short answer is that it can (usually) archive SPAs sucessfully. The long answer is that the wget archive method will download the JS, and in theory it should execute in the archived version just like it does on the normal site, but in practice it doesn't always work 100%. Luckily that's why ArchiveBox also saves a Chrome-rendered version of the page as a PDF, Screenshot, and DOM dump, so you should be able to archive most javascript-heavy content without too many problems.

joustfawrthis · on March 10, 2019

Great idea, but automated installer might not be ready for prime time.

Automated install did not detect existing Python 3.6 on OSX, created quite the mess...

Also, defaults to using google chrome vs. chromium when both are installed.

nikisweeting · on March 11, 2019

Yup, I totally agree (I'm the creator @pirate on Github).

I personally don't like automated setup scripts, which is one of the reasons I spent a bunch of time on our Manual Install docs: https://github.com/pirate/ArchiveBox/wiki/Install

Pip, apt, & homebrew packages are the install methods we're moving towards currently. I just caved to user demand in the short-term and added a helper script about a year ago as a crutch until we finished releasing the real packages.

Stay tuned for updates on that progress here: https://github.com/pirate/ArchiveBox/issues/120

voltagex_ · on March 10, 2019

Add a github issue?

C-Consciousness · on March 10, 2019

All browsers should aggressively cache everything they get and bypass all "anti-cache" inanities. There are addons that modify response headers but this is all just filthycasulness, everything is hostile to the user--just as it should be. If people do not get exsanguinated violently enough they feel really antsy. All data is to be immediately uploaded to Tor+IPFS. The browser is meant to "play coy" and act as though it did respect the headers in online mode, but shift to offline and it loads the entire history fully. This would work well with Tor Browser. Same thing should apply for all videos of course. To use such tools would break the anonymity set of people, which is why after each TBB close the entire cache2 folder should be exported to a file that can then be imported offline-only. A simple way to do this would be to simply copy the entire tor-browser directory and then reopen it in work-offline. The problem is the anti-cache websites are then lost, so just like JavaScript, refuse to use websites that use anti-cache mechanisms. The user is not supposed to feel like everything can get annihilated at any time: this level of hostility towards her will not go unpunished. Generally whoever takes anything down harms the Noosphere and should be viewed as an enemy of Posthumanism. Steaks on the table by choice and consent--treat them cruelly and without mercy. The Sibyl System will show no mercy on those who have ever forced users to enable JavaScript or prevented IA from archiving their pages.