Hacker News new | past | comments | ask | show | jobs | submit login

The main purpose of this "pixz" appears to be its chunking of the compressed data so that is it partially decompressible (i.e. random access). "xz" has -T/--threads= already for multithreaded processing (although it does seem like pixz has a different default value "all cores" instead of "1 thread") .



The --index option that I added (see another comment) to pigz allows random access so that uncompression could be multi-threaded.

The workload I was interested in (compressing virtual machine memory during suspend) tends to be I/O bound. pigz struck a good balance to make it so that cpus and disks both stayed busy in an important subset of the machines of interest. This balanced saturation seemed to be optimal to minimize the suspend time. If the suspend image was on fast storage (that is, could saturate a 10 gig link) multi-threading uncompression for resume made a big difference.

I tried pixz and I think pbzip2 and found that while the compress ratio was better than pigz offers, the delta in CPU-bound run time during compression was unacceptable.

The scales may tip in favor of pixz when compression speed is less important than (e.g.) minimizing the amount of bandwidth required for 1000's of downloads of the compressed content.


xz in multithreaded mode supports random access too, at least theoretically. But there's no reasonable way with xz to actually find the file in a tarball you want to access, it's that bit that pixz provides.

Another nice thing about pixz is it does parallel decompression, as well as compression.

(Disclaimer: I'm the original author of pixz.)


I was thinking about that "no reasonable way" comment. When you uncompress the first block, you will find the first tar header. From that you can know the uncompressed offset of the next tar header. If the compressed stream does support random access, you should be able to uncompress a block (assuming uncompressed block size was a multitple of 512 bytes) to get to the next tar header. You can repeat this until you get to the file you are looking for.

With large files, this approach would be of huge value. If the files tend to be no larger than block_size - 512, there will be no speedup.

Of course, this would need to be implemented directly in tar, not by piping the output of a decompression command through tar.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: