Hacker News new | past | comments | ask | show | jobs | submit login
4Q: the final archive format (github.com/robey)
56 points by bpierre on April 16, 2015 | hide | past | favorite | 46 comments



Final? Really? What if I need xattrs and Posix ACLs (or I use Windows and want NT ACLs and streams; or forks under OSX; …)? Hard-coded encryption algorithms also don't seem particularly future safe.


No forward error correction either. It can't be considered "final" without some form of bitrot protection.

And no, storing it on a ZFS or BTRFS volume with error recovery enabled does not count. (They don't use FEC, they use 1960's triplicate storage. Hugely wasteful of space, does nothing to protect against transmission errors and can still be corrupted by two of the exact wrong bits being damaged.)

Storing it on a media that does use FEC also does not count. I want a per-file tunable FEC knob, not one vendor-determined setting. And as history has shown, it needs to be done through FOSS code and not trade secret firmware.


The FEC is implemented in the hard drive firmware. You're going to read back a full correct block or the whole block will fail. You're not going to read back a block with a single bit error, so it is insanity to protect against that at the FS level.

Also read up on RAID levels. RAID5/6 or ZRAID7 uses error correction not duplication.


What about over the net? One of the listed strengths is its streaming capabilities, which implies sending it over the net. Error correction would be important there as well.


So you need something other than IP checksums, and whatever error correction is also operating in the layers below it?


I've downloaded and uploaded many files which have arrived incorrectly over the internet, so yes, if you truly want to prevent bitrot, you do want to add error correction.


This is a transport problem, however. Why not calculate the error correction codes on the fly? Why wast disk blocks?


Because you want to guarantee a certain error margin, regardless of the error detection & retransmit/correct capability of whatever networks/file systems/media it traverses in the interval till you want to read it.


Is there any FS that implements FEC?


Not that I'm aware of. The device layer usually does it already and it's supposed to be sufficient.


> it's supposed to be sufficient.

Supposed to. If you do any backups to DVD/BD and still want to get your data back in 5 or 10 years, though, you'd be well advised to do some sort of FEC - burn multiple copies of each disc, generate a bunch of PAR2, whatever.

(You might want to do that for backups on hard drives too. Yeah, maybe the hard drive firmware is supposedly taking care of any errors below the block level and you're not too worried about bitflips, but that just means you'll lose entire blocks and files when you lose something.)


If you count the CD and DVD spec books as file systems, yes.


BTRFS has bitrot-protection


Well, one toy file format more


The README implies that `tar` is "not streamable". Someone needs a history lesson on what it was originally used for...


Perhaps they meant seekable when compressed? As in the header for each object has the compressed length of object? Presumably in a meta data header with name, compression format, claimed uncompressed length, date, etc.

Personally I'd like a tool that lets me extract files from TB size archives easily without decompressing everything. Only virtual disk images seem to have that random access functionality. There are ways of using gzip/bzip/xz that use sync points, so you could produce a compatible archive that allowed decompressing metadata bits, though it would suck for many small files.


I think the README is just badly worded, since "not streamable" does differentiate it from RAR and ZIP, which makes it a differentiating factor.


Tar is not streamable when compressed with gzip. 4Q supports compression on a per-file basis and thus can support streaming and compression at the same time. Zip also supports compression on a per-file basis, but it requires an index and thus is not streamable.


Funny, I've done plenty of gzip compressed tar streams:

    nc -l 7000 | tar -xf -
    tar -czf - * | nc 1.2.3.4 7000
Files appear one by one on the remote end.


You can stream zips if you don't use compression, and you can use compression on files small enough to hold in memory. I work at Barracuda Networks, and we actually do this every day.[1]

[1] https://github.com/barracudanetworks/ArchiveStream-php


Things it doesn't support: symlinks, posix acls (xattrs). The first one makes it a certain failure for archival use. The hardcoded link to an external crypo service keybase makes it a failure for long term use.


"The final archive format" is a very big promise that 4q doesn't keep right now. It falls short of 7z, RAR and tar.xz, and certainly isn't ready to replace them at the moment.

I'm not too familiar with Coffeescript, but it doesn't seem like a good choice of language to write an archiver. There's no actual draft file format spec I can see, either? But from a first pass, I have the following comments:

Crypto: Encrypted blocks: AES-256-CBC, random IV, with no MAC (!!!). You need to look at that again: that could be a Problem. Hashed blocks: SHA-2-512. Maybe OK (how's length encoded? Look out for extension attacks). That crypto is 14 years old and missing a vital bit: not "modern". Modern choices would include CHACHA20_POLY1305 (faster, more secure, seekable if you do it right); hashes like BLAKE2b (as the new RAR already does); signing things with Ed25519. Look into that kind of thing. You need a crypto overhaul. The keybase.io integration is a nice thought for a UX - but is an online service in invite beta really ready for being baked into an archive format?

Packing: LZMA2 is pretty good: 7z and xz already use that. For a fast algorithm, Snappy is not as good as LZ4, I understand? Neither is the last word in compression. Text/HTML/source code packs much better with a PPM-type model, like PPMd (7z has that, too, as had RAR, but removed it recently), but you need to weigh up the decompression memory usage. ZPAQ's context model mixing can pack tighter, but that's much more intensive and while I like extensibility, I don't like the ZPAQ archive format having essentially executable bytecode.

Other missing features that other archivers have: Volume splitting? Erasure coding or some other FEC? Can you do deltas? (e.g. binary software updates)

You've got some pleasant UX ideas for a command-line archiver (compared to some other command-line archivers!), but sorry, I don't think you're ready for 1.0.


Somehow this makes me think of this xkcd comic https://xkcd.com/927/


I like the mouse over text: "Fortunately, the charging one has been solved now that we've all standardized on mini-USB. Or is it micro-USB? Shit."

Now we're going to transition to USB-C.


Unless we're going to transition to DockPort to tunnel USB over DisplayPort.


Fairly certain they dropped DockPort as USB Type-C covers all the functionality already.


I hope so. But when did that ever stop anyone?


Offtopic, but is XKCD's SLL Cert broken? Just noticed a big angry red 'X' in chrome through the icon.


Chrome started flagging SSL where the cert has SHA-1 in the chain, iirc xkcd was used as an example of where this occurs further up the chain than the site's acutal cert.

Just checked, if you look at the rapidSSL CA cert, it uses SHA-1.


Why would you make an archive format depend on a third-party service (keybase)? That's just a terrible idea for data longevity.


I really hope that this does not get mainstream or I will have to install yet another archiver tool...I really don't understand why people use 7zip for example when storage is cheaper than ever. Just use tar and get on with your life.


Maybe because it doesn't have unicode issues that zip has?


Put "cu=on" (without quotes) in the "Parameters:" box of 7-Zip's "Add to Archive" window, and all your Unicode issues are solved.


  "the final archive format"
https://xkcd.com/927/

;)


How is that any better than modern archive formats like 7z?


What are the advantages of this over tar?


It has a name that cleverly sounds like you are swearing.


The four things in the first paragraph of the link?


Tar already has the first two, and even POSIX xattrs (which this doesn't preserve), the third seems useless (seems being the key word here, some people might find it useful), and I'd rather just use a program that will encrypt the archive for me (i.e. have a .tar.xz.enc).

One advantage this could have over the above, is if you can open any file at random, as with the above scheme, you might have to linearly decrypt and decompress the entire archive up to that file.


Technically speaking then, the tar utility cannot compress or encrypt per-file, but the tar format can be used for this, and since we're talking about format then the tar format can accomodate the requirements. It's just that there's no tool doing it at the moment.

(A counter point: while each file could be compress and encrypted, there's nothing in the tar format that explicitly says so, meaning that each file would have to be probed to determine if it was compressed or encrypted)


> One advantage this could have over the above, is if you can open any file at random, as with the above scheme, you might have to linearly decrypt and decompress the entire archive up to that file.

4Q uses CBC and the crypto lib doesn't seem to support random access, unless you manually divide your file into separately encrypted streams.


Well, the first two are blatantly wrong. Tar is suitable for streaming and preserves the listed set of attributes.



It's in coffeescript! I wasn't expecting that, I was expecting C, Go, or Rust.


If it's final I guess the TODO section should be empty ;-)


Click the language bar of GitHub, see CoffeeScript, close the tab




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: