Hacker News new | past | comments | ask | show | jobs | submit login

40 + 15 days to compress? how long it would take to decompress this thing



7zip, like many other compression schemes, is optimized and designed so that decompression is typically (much) faster than compression.

The web page (http://www.7-zip.org/7z.html) states that the default "native" LZMA format decompresses at between 10 and 20 times the speed that it compresses.

So, 15 days / 15 is about one day to decompress, then.


LZMA is well-known for its decompression speed. This is one of the reasons it's a popular choice for filesystem compression. It's quite easy for LZMA to keep a pretty fair pace with a disk, so you get a pretty noticeable performance boost by adding LZMA at the filesystem layer, especially for read-heavy workloads.

Gzip usually gets a slightly better compression ratio, but at the expense of decompression speed, particularly on less compressible data (LZMA somehow seems to know better when to give up trying to compress). Bzip2 has the best compression ratio of the three, but is far too slow to compress and decompress, so you end up losing more time decompressing than you gained by doing less actual I/O.

EDIT: source, for those curious folk out there: http://portal.acm.org/citation.cfm?id=1534536 (caveat: the experiments were run by taking a large file on disk and compressing it to another large file on disk, so seek thrashing may have been an issue and I'm not quick to take the numbers for all they should be worth)


I'm pretty sure LZMA (7z) compresses better than bz2.

At least it does when I test it. But it's slower - sometimes much slower (depends on settings).


That might be true, though I don't know whether that depends on the entropy in the file. It is likely the case that one of them compresses text better but takes a larger performance hit with binary data or somesuch.

Turns out the paper I was recalling and referencing dealt with LZO, not LZMA, so maybe I have less to say about LZMA than I thought. Shows how much you jerks read before upvoting. ;-)


They also parallelize the compression (at least with bzip2) and run more than one dump process at the same time.


I imagine that the compression is only for saving time during transfer.

That, and hopefully you could decompress just what you needed. I.E. a tarball of compressed articles.


A lot of researchers just stream the decompressed stream directly from 7z into their analysis scripts. If you were to actually decompress to a giant XML file first, you'd both: 1. need a 6-TB drive; and 2. start getting disk I/O as a big bottleneck. In a lot of cases, the analysis scripts, rather than decompression, are the bottleneck anyway: 7z can feed you data faster than your XML parser and scripts can consume it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: