My guess is that this compresses less efficiently as you would have to shard the dictionaries. Might be close though for large files. I was surprised that there were no speed or efficiency comparisons in the README.
The max window size for zlib is 32 KB, so I don't think the default sharding at 128 KB would change much. You can pass the -b parameter if you find out a bigger shard works better on your data.
If you are looking for details of the design of pigz, there is a very well-documented overview in the source of pigz.c:
I tested it on a 680 MB of text. gzip compresses to 246.0 MB. pigz compresses to 245.5 MB. I see similar percent change on a 3.8 MB text file. So they are approximately equivalent.
First, thanks for the numbers, it's useful to see real world examples.
Second, and this isn't meant to be a critique (I'm just trying to understand phenomena I see), is there a reason you prefer presenting it as a percentage decrease? Every time I read "X% decrease" I feel obliged to read the source numbers because I'm never sure if the person is using the terminology correctly or not (you are), since so often people mess that up. For myself, I generally use "X ran in Y% of the time Z took." specifically because I don't want people to misinterpret. Is the "X% decrease" presentation preferred/taught, or considered standard? Am I alone in feeling it's more likely to be misinterpreted?
(Sorry your comment is the one I brought this up on, I've just been wondering this for a while.)
In my experience, it is more typical to use percent change or relative change in the physical sciences and this is how I was taught. Just to be clear, if you have values t1 and t2, the relative change is (t1-t2)/t1. There is a 1 to 1 correspondence with what you described which is t2/t1.
I think teej explained it well. If I say "the new value is +20% or -20%", it is immediately obvious those have the same magnitude and opposite direction. But, for some people, when I say "the new value is 120% or 80% of the old value", it is not immediately obvious that they have the same magnitude. It requires a small extra step for the reader to realize that this means the same amount of relative change.
I always state as relative change. I find that people can get really confused if you were to say "X ran in 110% in the time of Y" even though it is stated in a clear way.
My preferred way of communicating this concept is "we observed a +10% change in X compared to Y." I always use a +/- sign and this helps signal that I'm talking about a relative change.
If I am comparing percents, I'll always specify "relative" or "absolute" change though I prefer to use relative change. Occasionally if the change is small, I will use basis points instead of percents to communicate absolute change.
The input blocks, while compressed independently, have the last 32K of the previous block loaded as a preset dictionary to preserve the compression effectiveness of deflating in a single thread.
I'll check it out. However, if you have to wait for the previous block to compress the following block, I don't see how you can parallelize it completely. My assumption is that you would have to shard the file at a higher level and still compress those shards independently. That should get close to the same results as using a sequential compressor but for small file sizes both this effect and Amdahl's law would start to show its head. I suppose you could get around that by not parallelizing compression at some minimum size automatically.
> However, if you have to wait for the previous block to compress the following block, I don't see how you can parallelize it completely.
That above is not what the man-page says. The block size for any given session is fixed, so you know the boundaries of each block prior to compression.
Each block has a copy of the last 32kbytes of data at the end of the prior block.
The algorithm used by gzip compresses by finding repetitive strings in the last seen 32kbyte window of uncompressed data, so there is no compression dependency between blocks, even with a copy of the last 32k of uncompressed data from a prior block being present for the current block.
There's no "generating" of a dictionary. The gzip algorithm is based upon the "dictionary" being the last seen 32k bytes of uncompressed data based upon where in the file the compressor is presently working (technically in compression circles it is a 'windowed' algorithm, not a 'dictionary' algorithm). It compresses by finding a repetitive string in that 32k window that matches a string at the current ___location, and outputting an instruction that says (in effect): "seek backwards in the uncompressed data 200 bytes, then copy 100 bytes from there to here".
So as long as each parallel block has pre-pended the final 32k of the prior block, the output from the compression algorithm will be identical between a straight sequential compression and a pigz parallel compression. Because at byte 0 of the current block, the 32k window of the uncompressed prior block is available to perform matching against, just as if it were running sequentially.
The only growth from pigz comes from needing to round each parallel compressed block up to a multiple of 8 bits (the huffman codes that are output are bitstrings that don't match with 8-bit byte boundaries). But worst case that is 7 bits per block for each parallel block. Given the performance gains on multiple CPU systems, a few hundred bytes net increase is not likely to matter. If those few hundred bytes did matter, then one should use bzip2 or lzip or xz and get much higher compression ratios (at the expense of much longer time).