4Q: the final archive format

creshal · on April 16, 2015

Final? Really? What if I need xattrs and Posix ACLs (or I use Windows and want NT ACLs and streams; or forks under OSX; …)? Hard-coded encryption algorithms also don't seem particularly future safe.

keenerd · on April 16, 2015

No forward error correction either. It can't be considered "final" without some form of bitrot protection.

And no, storing it on a ZFS or BTRFS volume with error recovery enabled does not count. (They don't use FEC, they use 1960's triplicate storage. Hugely wasteful of space, does nothing to protect against transmission errors and can still be corrupted by two of the exact wrong bits being damaged.)

Storing it on a media that does use FEC also does not count. I want a per-file tunable FEC knob, not one vendor-determined setting. And as history has shown, it needs to be done through FOSS code and not trade secret firmware.

maaku · on April 16, 2015

The FEC is implemented in the hard drive firmware. You're going to read back a full correct block or the whole block will fail. You're not going to read back a block with a single bit error, so it is insanity to protect against that at the FS level.

Also read up on RAID levels. RAID5/6 or ZRAID7 uses error correction not duplication.

falcolas · on April 16, 2015

What about over the net? One of the listed strengths is its streaming capabilities, which implies sending it over the net. Error correction would be important there as well.

rsanders · on April 16, 2015

So you need something other than IP checksums, and whatever error correction is also operating in the layers below it?

falcolas · on April 16, 2015

I've downloaded and uploaded many files which have arrived incorrectly over the internet, so yes, if you truly want to prevent bitrot, you do want to add error correction.

maaku · on April 16, 2015

This is a transport problem, however. Why not calculate the error correction codes on the fly? Why wast disk blocks?

angry_octet · on April 17, 2015

Because you want to guarantee a certain error margin, regardless of the error detection & retransmit/correct capability of whatever networks/file systems/media it traverses in the interval till you want to read it.

gtirloni · on April 16, 2015

Is there any FS that implements FEC?

creshal · on April 16, 2015

Not that I'm aware of. The device layer usually does it already and it's supposed to be sufficient.

gwern · on April 16, 2015

> it's supposed to be sufficient.

Supposed to. If you do any backups to DVD/BD and still want to get your data back in 5 or 10 years, though, you'd be well advised to do some sort of FEC - burn multiple copies of each disc, generate a bunch of PAR2, whatever.

(You might want to do that for backups on hard drives too. Yeah, maybe the hard drive firmware is supposedly taking care of any errors below the block level and you're not too worried about bitflips, but that just means you'll lose entire blocks and files when you lose something.)

keenerd · on April 16, 2015

If you count the CD and DVD spec books as file systems, yes.

esarbe · on April 17, 2015

BTRFS has bitrot-protection

jarman · on April 16, 2015

Well, one toy file format more

al2o3cr · on April 16, 2015

The README implies that `tar` is "not streamable". Someone needs a history lesson on what it was originally used for...

angry_octet · on April 16, 2015

Perhaps they meant seekable when compressed? As in the header for each object has the compressed length of object? Presumably in a meta data header with name, compression format, claimed uncompressed length, date, etc.

Personally I'd like a tool that lets me extract files from TB size archives easily without decompressing everything. Only virtual disk images seem to have that random access functionality. There are ways of using gzip/bzip/xz that use sync points, so you could produce a compatible archive that allowed decompressing metadata bits, though it would suck for many small files.

jdpage · on April 16, 2015

I think the README is just badly worded, since "not streamable" does differentiate it from RAR and ZIP, which makes it a differentiating factor.

FooBarWidget · on April 16, 2015

Tar is not streamable when compressed with gzip. 4Q supports compression on a per-file basis and thus can support streaming and compression at the same time. Zip also supports compression on a per-file basis, but it requires an index and thus is not streamable.

falcolas · on April 16, 2015

Funny, I've done plenty of gzip compressed tar streams:

    nc -l 7000 | tar -xf -
    tar -czf - * | nc 1.2.3.4 7000

Files appear one by one on the remote end.

johnmaguire · on April 16, 2015

You can stream zips if you don't use compression, and you can use compression on files small enough to hold in memory. I work at Barracuda Networks, and we actually do this every day.[1]

[1] https://github.com/barracudanetworks/ArchiveStream-php

dajobe · on April 16, 2015

Things it doesn't support: symlinks, posix acls (xattrs). The first one makes it a certain failure for archival use. The hardcoded link to an external crypo service keybase makes it a failure for long term use.

qrmn · on April 16, 2015

"The final archive format" is a very big promise that 4q doesn't keep right now. It falls short of 7z, RAR and tar.xz, and certainly isn't ready to replace them at the moment.

I'm not too familiar with Coffeescript, but it doesn't seem like a good choice of language to write an archiver. There's no actual draft file format spec I can see, either? But from a first pass, I have the following comments:

Crypto: Encrypted blocks: AES-256-CBC, random IV, with no MAC (!!!). You need to look at that again: that could be a Problem. Hashed blocks: SHA-2-512. Maybe OK (how's length encoded? Look out for extension attacks). That crypto is 14 years old and missing a vital bit: not "modern". Modern choices would include CHACHA20_POLY1305 (faster, more secure, seekable if you do it right); hashes like BLAKE2b (as the new RAR already does); signing things with Ed25519. Look into that kind of thing. You need a crypto overhaul. The keybase.io integration is a nice thought for a UX - but is an online service in invite beta really ready for being baked into an archive format?

Packing: LZMA2 is pretty good: 7z and xz already use that. For a fast algorithm, Snappy is not as good as LZ4, I understand? Neither is the last word in compression. Text/HTML/source code packs much better with a PPM-type model, like PPMd (7z has that, too, as had RAR, but removed it recently), but you need to weigh up the decompression memory usage. ZPAQ's context model mixing can pack tighter, but that's much more intensive and while I like extensibility, I don't like the ZPAQ archive format having essentially executable bytecode.

Other missing features that other archivers have: Volume splitting? Erasure coding or some other FEC? Can you do deltas? (e.g. binary software updates)

You've got some pleasant UX ideas for a command-line archiver (compared to some other command-line archivers!), but sorry, I don't think you're ready for 1.0.

simondelacourt · on April 16, 2015

Somehow this makes me think of this xkcd comic https://xkcd.com/927/

edward · on April 16, 2015

I like the mouse over text: "Fortunately, the charging one has been solved now that we've all standardized on mini-USB. Or is it micro-USB? Shit."

Now we're going to transition to USB-C.

creshal · on April 16, 2015

Unless we're going to transition to DockPort to tunnel USB over DisplayPort.

lsaferite · on April 16, 2015

Fairly certain they dropped DockPort as USB Type-C covers all the functionality already.

creshal · on April 16, 2015

I hope so. But when did that ever stop anyone?

tomtoise · on April 16, 2015

Offtopic, but is XKCD's SLL Cert broken? Just noticed a big angry red 'X' in chrome through the icon.

tehbeard · on April 16, 2015

Chrome started flagging SSL where the cert has SHA-1 in the chain, iirc xkcd was used as an example of where this occurs further up the chain than the site's acutal cert.

Just checked, if you look at the rapidSSL CA cert, it uses SHA-1.

joepie91_ · on April 16, 2015

Why would you make an archive format depend on a third-party service (keybase)? That's just a terrible idea for data longevity.

edem · on April 16, 2015

I really hope that this does not get mainstream or I will have to install yet another archiver tool...I really don't understand why people use 7zip for example when storage is cheaper than ever. Just use tar and get on with your life.

ojanik · on April 16, 2015

Maybe because it doesn't have unicode issues that zip has?

kozak · on April 16, 2015

Put "cu=on" (without quotes) in the "Parameters:" box of 7-Zip's "Add to Archive" window, and all your Unicode issues are solved.

whitingx · on April 16, 2015

  "the final archive format"

https://xkcd.com/927/

;)

__michaelg · on April 16, 2015

How is that any better than modern archive formats like 7z?

_3u10 · on April 16, 2015

What are the advantages of this over tar?

zamalek · on April 16, 2015

It has a name that cleverly sounds like you are swearing.

stavros · on April 16, 2015

The four things in the first paragraph of the link?

anon4 · on April 16, 2015

Tar already has the first two, and even POSIX xattrs (which this doesn't preserve), the third seems useless (seems being the key word here, some people might find it useful), and I'd rather just use a program that will encrypt the archive for me (i.e. have a .tar.xz.enc).

One advantage this could have over the above, is if you can open any file at random, as with the above scheme, you might have to linearly decrypt and decompress the entire archive up to that file.

rakoo · on April 16, 2015

Technically speaking then, the tar utility cannot compress or encrypt per-file, but the tar format can be used for this, and since we're talking about format then the tar format can accomodate the requirements. It's just that there's no tool doing it at the moment.

(A counter point: while each file could be compress and encrypted, there's nothing in the tar format that explicitly says so, meaning that each file would have to be probed to determine if it was compressed or encrypted)

creshal · on April 16, 2015

> One advantage this could have over the above, is if you can open any file at random, as with the above scheme, you might have to linearly decrypt and decompress the entire archive up to that file.

4Q uses CBC and the crypto lib doesn't seem to support random access, unless you manually divide your file into separately encrypted streams.

dietrichepp · on April 16, 2015

Well, the first two are blatantly wrong. Tar is suitable for streaming and preserves the listed set of attributes.

FooBarWidget · on April 16, 2015

https://news.ycombinator.com/item?id=9387959

pjc50 · on April 16, 2015

It's in coffeescript! I wasn't expecting that, I was expecting C, Go, or Rust.

sylvinus · on April 16, 2015

If it's final I guess the TODO section should be empty ;-)

mahouse · on April 16, 2015

Click the language bar of GitHub, see CoffeeScript, close the tab