Hacker News new | past | comments | ask | show | jobs | submit login
Smaller and faster data compression with Zstandard (facebook.com)
819 points by jamesgpearce on Aug 31, 2016 | hide | past | favorite | 156 comments



I have been waiting for this to hit 1.0 and more importantly get popular so that I can use it everywhere. I am really a fan of Yann Collet's work. These are extremely impressive work specially when you consider that lz4 seems to be better than snappy (by google) and zstandard from LZFSE (from apple). I think he is the first one to write a practical fast arithmetic coder using ANS. And look at how his huffman implementation blazes past zlib huffman though compresses less than FSE [0]. I also like reading his blog posts. While a lot of them goes over my head I can generally make a sense of what he is trying and why something's working despite the complexity.

[0] https://github.com/Cyan4973/FiniteStateEntropy


Cyan4973/Yann also is well known for xxhash[1], which is one of the faster hashers[2] out there. Post a new hasher and people will probably ask about xxhash (ex: metrohash[3]). Guy is an absolute machine. If you search Google for 'zstd' right now, you'll find him, not Facebook, namely: https://github.com/Cyan4973/zstd . Glad his work is being supported by someone now! Immensely well deserved, after so many years of helping make everyone fast.

PS - I didn't know a lot of the terms you used, but the Finite State Entropy (FSE) link you provided does a good job intro'ing some of them, and the linked paper Asymmetric numeral systems: entropy coding combining speed of Huffman coding with compression rate of arithmetic coding [4] (ANS) seems interesting.

[1] https://github.com/Cyan4973/xxHash

[2] https://github.com/rurban/smhasher

[3] https://news.ycombinator.com/item?id=9613098

[4] http://arxiv.org/abs/1311.2540


I found his (Yann's) blog posts the most helpful for getting a basic idea of what was going on with FSE: https://fastcompression.blogspot.com/2013/12/finite-state-en... and the followup https://fastcompression.blogspot.fr/2014/01/fse-decoding-how...


Yes. I forgot about xxHash :). I think it was created as part of checksumming on LZ4 (not entirely sure). That is another amazing part of his work. He is producing this things xxHash, FSE, Huff0 that are state of the art projects in their own right for his state of the art compression algorithm making sure others can benefit the most from his work without reinventing it. At the same time he is also blogging his though process and experiment in a manner that even a layman like me gets the gist of for most part. Now whether its useful enough for an expert in the field I can't say.

There is a lot of other gem in his blog. He points to resources and other people he have worked with making it much easier to get a bigger picture on the related items.


> I think he is the first one to write a practical fast arithmetic coder using ANS.

I don't think he is the first; although RAD game tools has been cagey about the details, many strongly suspect their recently announced Kraken, etc. use ANS in some form and some of their previous products; see the discussion here:

https://news.ycombinator.com/item?id=11583898

...and here:

http://encode.ru/threads/2492-Kraken-compressor

http://cbloomrants.blogspot.ca/2016_05_01_archive.html


Yeah read about Kraken a while ago and found the encode.ru thread you pointed. Its unfortunate that we can't compare that side by side. Those guys seem another of those compression geniuses. Overall the world of compression is moving very fast and well these days.

But even with that Yann might be the first. Yan's work on FSE is old [0] and if I'm reading this correctly his work on FSE is a mix of his own work and Jarek Duda. But it seems one of his failed attempt challenged Jarek to invent another version of ANS called rANS [1] and as you can see in the encode.ru comments from the RAD game tools that they seem to be using rANS at least for Kraken. Regardless these are impressive works and these people are bouncing ideas of each others, being challenged and inspired which is a very good thing for the rest of us :).

[0] https://fastcompression.blogspot.com/2013/12/finite-state-en... [1] http://encode.ru/threads/1821-Asymetric-Numeral-System?p=360... [2] https://github.com/rygorous/ryg_rans


Before Yann, this ANS variant was implemented by Andrew Polar in 2008: http://www.ezcodesample.com/abs/abs_article.html Here is a list of implementations: http://encode.ru/threads/2078-List-of-Asymmetric-Numeral-Sys...


Yeah, it's all very hazy, and since RAD is (understandably) proprietary, it's difficult to figure out exactly when who did what.


There are now actually open source decompressors[0] for Oodle formats, and yes, many use rANS. Especially LZNA's clever use of SSE2 to decode and update 16 symbol models is interesting. FSE (which is tANS) is a different beast, though, and more suitable for rarely updated models.

[0] http://encode.ru/threads/2577-Open-source-Kraken-Mermaid-Sel...


...and got DMCAed [1]. At the very least the discussion itself is interesting for the inside infos, though.

[1] https://github.com/github/dmca/blob/master/2016/2016-08-30-O...


That was because of the belief that he had used an illegal copy of the DLL, and that he had copied all implementation details (neither is true). A representative from RAD said they would withdraw the DMCA complaint now after that misunderstanding was cleared.


Now here things gets really interesting [0]. And the fact that the claim is that higher level of compression has negligible impact on decode speed making Kraken really impressive for one time compression and lots of decompression situation.

[0] http://cbloomrants.blogspot.ca/2016/07/introducing-oodle-mer...


RAD use rANS for their compressors. See ryg's blog [0] for more details, and a bunch more interesting posts about compression and various low-level optimizations.

[0] https://fgiesen.wordpress.com/2015/12/21/rans-in-practice/


I agree, I just started using Zstd for uniprot.org (dev branch) and it looks like it is a lot faster in decoding than deflate or zlib and even smaller on disk. It is one of those things where the improvement really matters for us and gives us faster results for our users.

i.e. download speed is up to 20% faster with Zstd as compression algo for backing store compared to Deflate. Assuming bandwidth is available ;)


There is just so much awesome stuff in this article. Finite State Entropy and Asymmetric Numeral System are completely new concepts to me (I've got 7 open tabs just from references FB supplied in the article), as is repcode modeling. I love that they've already built in granular control over the compression tradeoffs you can make, and I can't wait to look into Huff0. If anyone outside of Facebook has started playing with it or is planning to put it into production right away I'd love to hear about it.


Indeed ANS is difficult: it is the biggest innovation in compression in the last 20 years. Its author has some nice but dense slides about it.

https://dl.dropboxusercontent.com/u/12405967/ANSsem.pdf

Not sure exactly when repcodes were invented. Igor Pavlov has already used them in 7zip.


The best TL;DR of ANS is something like this (without being too wrong). It's still too long:

Huffman requires at least one bit to represent any symbol, because it finds unique prefix codes for every symbol by varying the leading bits.

Arithmetic coding encodes symbols as fractional numbers of bits, by using binary fractions. It divides up the range to make this work. In the end, you get one fraction per "message" that can be decoded back into the message (IE a fraction like 0.53817213781271231 ....)

range coding is similar, it just uses integers instead of floating point. You get one number that can be decoded back into the message (IE a number like 12312381219129123123).

Note that in both of these, you have not changed the number system at all. The number of even numbers and odd numbers still has the same density.

Another way to look at the above is based on what a bit selects.

In huffman, a single bit is generally not enough to select something, the prefixes are long. So you need to walk a bunch of bits to figure out what you've got.

In arithmetic or range coding, a single bit selects a very large range. They change the proportions of those ranges, but at some point, something has to describe that range. This is because it's a range. Knowing you have gotten to the 8 in 0.538 doesn't tell you anything on it's own, you need to know "the subrange for symbol a is 0.530 ... 0.539", so it's a. So you have to transmit that range.

ANS is a different trick. Instead of encoding things using the existing number system, it changes the number system. That is, it redefines the number system so that our even and odd numbers are still uniformly distributed but have different densities. If you do this in the right way, you end up with a number system that lets you only require one number to determine the state, instead of two (like in a range).


With regular arithmetic coding and Huffman coding you don't need to send over a dictionary. Instead, you can have an adaptive model that learns to compress the data as it goes (e.g. keeps running track of symbol frequencies), and it will still be reversible.

I thought this wasn't possible with ANS. Or has this changed?


It depends if you have static or adaptive coder.

Static is much cheaper, uses the same probabilities for the entire data block (e.g. 30 kB), probabilities are stored in the header - practically all Huffman and tANS compressors (however, there are considered exceptions: https://en.wikipedia.org/wiki/Adaptive_Huffman_coding ).

Adaptive can start with e.g. uniform probability (no need to store in header) and learns on the way - it is more costly but gives better compression, used with arithmetic coding or rANS. See https://fgiesen.wordpress.com/2015/05/26/models-for-adaptive... https://fgiesen.wordpress.com/2015/12/21/rans-in-practice/


I know that's what adaptive coding means, I thought that was impossible with rANS. All the descriptions of rANS I could find (back when I looked at it) described it terms of a static model, and I ran into problems trying to generalize it to an adaptive one.

Thanks for the resources.


I think you should look at the author's blog [0]. There is lots of in depth explanation of not just what he is doing but how he got there, the failed steps, the intermediate steps.

[0] http://fastcompression.blogspot.in/


The plot of compression ratio against speed for the various compression levels is pretty helpful for understanding its performance: https://scontent.fsnc1-3.fna.fbcdn.net/t39.2365-6/14146892_9...

"The x-axis is a decreasing logarithmic scale in megabytes per second; the y-axis is the compression ratio achieved."

I'd love to see a version of this chart that also included Brotli. (And I'm somewhat surprised Brotli isn't mentioned at all.)

(Disclaimer: I work at Google, which made Brotli)


There's a nice comparison of compression algorithms (including zlib, zstd, brotli, snappy, etc) here: https://quixdb.github.io/squash-benchmark/ It's nice because it uses many datasets and machine platforms.

Unfortunately the graphs leave a bit to be desired, especially when you're trying to compare two algorithms. They provide the raw data, but it needs a little munging to make it workable.

For my day job I made the following graph to compare zlib level1 and zstd. This was using the "peltast" platform which is a Xeon based system, because that was most relevant for us.

http://imgur.com/a/0a3kK

I'll make one with brotli now.


Imgur album now compares brotli and zstd. Looks like zstd is always significantly faster, but Brotli compresses slightly better for every dataset except the XML one.

This is with brotli level 1, by the way. My understanding is that brotli is pretty quick through the first few levels, but the levels that ask for the highest compression are insanely slow (which is a valuable thing to have as an option, for things like game assets or something, which are compressed once and delivered many times!)


For "compress once, decompress many times" workloads like game assets, you might want to check out Oodle, a proprietary library that kills zstd in decompression speed while achieving better compression ratios.

http://cbloomrants.blogspot.com/2016/07/introducing-oodle-me...


I thought that brotli was tuned for typical web workloads, that it contained a dictionary tuned for web workloads. Our internal testing shows that it performs very poorly for binary 3D vector data.

So a test between zstd and brotli would show brotli in a poor light if it used a mixed corpus, but a test between zstd and brotli on a web corpus would give an advantage to brotli...


Not just tuned for web workloads in general, but for specific web workloads. The Brotli dictionary is mostly composed of English words and phrases, and fragments of HTML, CSS, and Javascript. It would perform poorly on non-English text.

I have a feeling that the dictionary was designed with the specific goal of performing well on a specific corpus similar to the Large Text Compression Benchmark[1]. It has quite a few words and phrases that I'd associate with Wikipedia's "house style".

[1]: https://cs.fit.edu/~mmahoney/compression/textdata.html


The dictionary also contains lots of Chinese, Russian and Arabic


By the numbers:

- 9216 phrases total

- 5857 (63.5%) pure ASCII phrases, mostly English with a few Spanish words thrown in

- 1372 (14.8%) code fragments -- mostly HTML, CSS, and Javascript

- 1027 (11.1%) CJK (Chinese, Japanese, and Korean, and mostly the first two) phrases -- it's very hard to tell Chinese and Japanese apart in this context; I didn't try

- 158 (1.7%) phrases containing extended Latin-1 characters (nearly all Spanish words)

- 303 (3.3%) Cyrillic script (probably Russian) phrases

- 322 (3.5%) Arabic phrases

- 172 (1.9%) Devanagari script (Hindi) phrases

Plus a few miscellaneous other scripts and generally unclassifiable content.


'tuned' for the brotli dictionary is a bit generous [1], but yes, it contains a smattering of strings [2] you'd find in plain text and web documents.

Because of its seemingly haphazard dictionary, I wouldn't rule out zstd outperforming brotli, if trained on a good dataset.

[1] https://news.ycombinator.com/item?id=12010313

[2] https://gist.github.com/klauspost/2900d5ba6f9b65d69c8e


>Our internal testing shows that it performs very poorly for binary 3D vector data.

Uh ? What format of data was this ?

I did a pretty large test (2gb+) on OBJ/STL 3d data and brotli compressed within ~5% margin of lzma, and this holds true on other binary data I've compared.

It also compressed better than zstd (as in compression ratio) on the same data at their highest respective compression settings:

bro -quality 11 -window 24

zstd --ultra -22

So I find it baffling that you find brotli to be poor for any binary data, could you share the data in question ?


What is the compression time for brotli to achieve compression ratios comparable to LZMA on these types of data sets: https://github.com/google/brotli/issues/165

I understand that brotli is incredibly slow as compared to LZMA when using these high ratio settings (q11, w24), so slow as to be impractical in production even in a write-once, read-many scenario if you have any non-trivial amount of data being regularly produced. I do not want to have a farm of machines just to handle the brotli compression load of our data sets just because it is 10x slower than LZMA.


>What is the compression time for brotli to achieve compression ratios comparable to LZMA...

Compression time is indeed the achilles heel of Brotli, which is why it's something I would only use for compress once (and preferably decompress very often) scenarios.

Compared to lzma at it's best compression setting for this particular data (-mx9 -m0=LZMA:d512m:fb273:lc8), brotli took 6 minutes and 4 seconds to compress, while the same data took 1 minute and 47 seconds for lzma.

On the other hand, brotli decompressed the same data in 0.6 seconds, while it took lzma 2.2 seconds.


Brotli is available in this graph - https://github.com/facebook/zstd/blob/master/images/DCspeed5...

This seems to be slightly old doc - from April.


The README at https://github.com/facebook/zstd mentions brotli



Would be especially interesting to have zstd work with brotli's dictionary for the comparison.


Note: this is from the same guy who created the popular lz4 compressor, Yann Collet: http://cyan4973.github.io/lz4/

https://twitter.com/Cyan4973


Yann will be giving a talk on Zstandard at today's @Scale 2016 conference, and the video will be posted. He can answer the most technical questions about Zstandard, but I may be able to answer some as well; we both work on compression at Facebook.


I am really looking forward to this. I usually like to read more than vidoes but for complicated topic with a good presenter it can actually be a comprehensive starting point. Would the video also be posted today or we will have to wait?

One thing I haven't figured out from either today's post or Yann's blog is whether Zstandard is switching between huff0 and FSE depending on compression level or is it somehow using both together? Also the post says its both OoO friendly and multi-core friendly but the speed benchmarks are those in a single core context or multi-core? Does only the format/algorithm multi-core friendly or the standard cli can run multi-threaded.


I'm not sure when videos will be posted -- hopefully soon.

All benchmarks today are single threaded. The algorithm itself is single threaded, but can be parallelized across cores. We will soon release a pzstd command line utility to demonstrate this, similar to pigz, which accelerates both compression and decompression.

Zstandard uses both huff0 and FSE together when it compresses -- it doesn't switch between them based on the input.



Hey quick question, and sounds awesome.

If I wanted to use this in the pipeline of my servers Journaling system, is there any requirement that I restart the stream periodically.

That is to say, should I use it per journal entry (probably not a good idea for short messages), for the entire uptime of the writer, or with periodic restarts?

Obviously can measure and find out for myself, but wondered if you had any thoughts. Thanks!


The modern trend of compressors is to use more memory to achieve speed. This is good if you're using big-iron cloud computers...

"Zstandard has no inherent limit and can address terabytes of memory (although it rarely does). For example, the lower of the 22 levels use 1 MB or less. For compatibility with a broad range of receiving systems, where memory may be limited, it is recommended to limit memory usage to 8 MB. This is a tuning recommendation, though, not a compression format limitation."

8MB for the smallest preset? Back in the mid-2000s, I was attending a Jabber/XMPP discussion, about the viability of using libz for compressing the stream. It turned out that even just a 32kb window is huge when your connection server is handling thousands of connections at a time, and they were investigating the effect of using a modified libz with an even smaller window (it was hard-coded, back then).

I know Moore's law is in ZStandard's favor w.r.t. memory usage (what's 8MB when your server's got 64GB or more?), but I think it's useful to note that this is squarely aimed at web traffic backed by beefy servers.


Any modern server that handles a thousand or more concurrent connections on commodity hardware already uses only about as many threads as there are processor cores. In that architecture it's trivial to also limit the number of compression threads to the number of processor cores. That architecture gives the best performance and very low memory use.

In the mid-2000 it was still accepted norm to spawn one thread for each connection, where memory usage of the compressor would have been a problem. I doubt that it's a problem with today's software architecture.


A server like this could only work by buffering an entire response before compressing it once, requiring (compressed+uncompressed) bytes temporary space. In reality most servers of the design you mention operate streamily, flushing the compressor's output just in time as the backend fills its input buffer. In designs like that (most of them) a compression context per connection is still required


Not sure I agree. The 8 MB is the recommended upper limit so I don't think anyone is planning to use it for the web traffic. I think its designed to be faster and have better compression even at lower window size though not sure how low. It most likely perform better than zlib even at 32KB at least faster I would assume. Now if you are a jabber/chat server opening thousands of long running connection there it can be an issue. You already said how even standard zlib doesn't work there.


I don't think 8MB is the smallest preset, the text you quoted says that the lower levels use "1 MB or less".

The concern I have is that this makes it sound like the compressor can choose how much memory the decompressor will need to use. Does this mean that zstd can't be used in a potentially adversarial environment? (Eg. is there a denial-of-service vector here by forcing the server to use large amounts of memory to decompress my requests?)


The decompressor can receive a parameter to refuse compressed frames requiring more than a selected amount of memory.


It will not use (much) more memory than the size of the output in any case. 8MB is the window here, which just means the decompressor can discard data that falls outside this 8MB window as it is decompressing.


Minix 1.5 for 8086 had a slightly mad decompress program that would fork itself up to 4 times in order to address enough memory to decompress certain *.Z files:

https://github.com/jbruchon/elks/blob/master/elkscmd/minix1/...


Didn't know there were still people working on elks...


You can get pretty far with "amnesiac" zlib for networking, too. You collect up writes in your out-buffer, and use zlib to compress it before transmission. The trick is that you don't retain context or find matches between chunks, so there's no memory overhead between sends.


> what's 8MB when your server's got 64GB

Note that 8Mb is the size of the L3 cache on some modern Intel chips. You want fast lookups in the window area, where you constantly do random-access reads.


use slz ( http://1wt.eu/projects/libslz/ ) for web traffic and non-beefy servers.


I'm a complete dunce when it comes to compression and how it fits in the industry, so help me out here. Say that everyone accepts that Zstandard is amazing and we should start using it. What would the adoption process look like? I understand individual programs could implement it since they would handle both compression and decompression, but what about the web?

Would HTTP servers first have to add support, then browser vendors would follow?


>what about the web

The browser sends the server a request header indicating which compression methods it understands. Current Firefox for example sends

    Accept-Encoding: gzip, deflate, br
meaning the server is free to send a response compressed with either gzip, deflate or brotli. Or the server can choose to send the data uncompressed.

This means the adoption path for the web would be an implemtation in at least one major browser, which advertises this capability with the Accept-Encoding header. Then any server can start using Zstandard for clients accepting it.


Nice, so really browsers and servers can implement it independently. Seems elegant, I like that browsers can advertise multiple compression formats. Thanks for the info!


It's also possible to implement a decompressor in javascript to support browsers which don't do it natively. The performance would likely suck but if you're truly bandwidth constrained and don't mind users having a bit of a lag, it's an option...


Assuming the size of the decompressor isn't larger than the savings you gained from using this compression algo over another...


Well the decompressor would be cached by the browser so it would pay off for repeat visitors. And if it gets used enough, the user will already have downloaded the decompressor from another site.


And easily shared across sites if they all use a popular CDN'd version.


Some more benchmarks on this[0] page

Also, I actually discovered something very interesting (to me at least). At the bottom of the link mentioned below, the link attached says https://github.com/Cyan4973/zstd but then redirects to https://github.com/facebook/zstd . Anyone know why?

[0]: http://facebook.github.io/zstd/

EDIT: After a little bit of sleuthing, it looks like the author of zstd (github.com/Cyan4973) is now contributing[1] to github.com/facebook/zstd

And the page layout for lz4[2] looks the same as zstd[0]

Anyone know if Yann Collet works for/with facebook on things other than zstd?

EDIT 2: In the time it took me to google a couple things, looks like the children comments have already answered my questions.

Also, previous discussions on zstd (not that its completely relevant) -https://news.ycombinator.com/item?id=8941955 https://www.reddit.com/r/programming/comments/2tibrh/zstd_a_...

[1]:https://github.com/facebook/zstd/pull/312 [2]: http://cyan4973.github.io/lz4/


> the link attached says https://github.com/Cyan4973/zstd but then redirects to https://github.com/facebook/zstd . Anyone know why?

I believe this is just GitHub's standard behavior when a repository is moved to another namespace. We recently renamed https://github.com/D-Programming-Language to https://github.com/dlang , and e.g. https://github.com/D-Programming-Language/dmd redirects in the same way.


Oh, D language is dlang now? Looks like you guys did a redesign of the website and stuff too! I hear good things about D, keep up the great work!


Thanks! To clarify, we didn't rename the language, just the GitHub organization. However, dlang.org is what we've been using as the canonical address for the website, and "dlang" is a much more searchable term than just "D". (Now, if only Google would stop auto-correcting "dlang" to "golang"...)


It hasn't been long that I fetched zstd from https://github.com/Cyan4973/zstd

Upon looking further, it turns out that the author Yann Collet [0] works with Facebook now; the repo would have been thus transferred to Facebook. Author's blog [1].

[0] https://twitter.com/Cyan4973

[1] http://fastcompression.blogspot.in/


>It hasn't been long that I fetched zstd from https://github.com/Cyan4973/zstd

Yep, and Google too

https://www.google.co.in/search?q=zstd

> it turns out that the author Yann Collet [0] works with Facebook now

yep https://www.facebook.com/yann.collet.73

On another note, wonder what happens when personal and professional lives can interfere. The profile of Yann Collet in the blog post is the above link and I can't help but think, why am I on faceook looking at a baby's picture on someone's profile instead of Github or at least LinkedIn? Seems like something he might want to keep private (and yes, I'm totally assuming here, I don't know his preferences)

Now, I know you can restrict the stuff you post to friends only instead of public like he did, but it is still something people(and facebook for its employees) should consider if they want their facebook profile to become their professional contact page.


Cyan4973 is hired by Facebook. He is famous author of LZ4. Likely they decided to move their sponsor project to their organization most likely for PR.


Presumably the ownership of the repo was transfered from Cyan4973, the main contributor, to facebook, and github automatically redirects.


yep, looks like it. Maybe just forgot to update the gh-pages branch.


Downvoted? Wait, I don't think I'm getting that wrong. They probably would've preferred to avoid the outdated links that this[0] proves this.

[0]:https://github.com/facebook/zstd/pull/313


The goals sound similar to Apple's LZFSE (see https://github.com/lzfse/lzfse for more). Any comparison out there?


That was my first thought, too. I installed and ran both against a tar'd set of PDF files totaling 435MB in size. My timings:

    lzfse  45 MB/s encode, 229 MB/s decode, 1.12 comp ratio
    zstd  181 MB/s encode, 713 MB/s decode, 1.13 comp ratio
The numbers are so dramatically different that I ran several different tests, but those results showed the same rough results. I used default command-line options for both tools, and both created very similar compression ratios.

Note that LZFSE has a somewhat different goal, however: it's designed to be the most power-efficient compression algorithm out there, in other words on mobile devices LZFSE optimizes for bytes-per-watt rather than bytes-per-second. Zstandard, on the other hand, runs multiple pipelines and such--it's banking on having a server-class processor to run on.

Edit: hardware is a 2013 MacBook Pro, pretty fast flash storage, and 2 cores/4 threads. I warmed cache before each run and sent output to /dev/null, so the numbers above are best-case.


"Note that LZFSE has a somewhat different goal, however: it's designed to be the most power-efficient compression algorithm out there, in other words on mobile devices LZFSE optimizes for bytes-per-watt rather than bytes-per-second."

Sure, but is that even relevant? I mean, is there any way that lzfse could possibly be more power-efficient per byte than zstd when zstd is 3-4 times faster for the same compression ratio? According to the docs zstd doesn't have any support for multiple threads right now, so it should be a fair comparison.


I use "fastest to complete == least power usage" as a rule of thumb because of "race to sleep". I suppose that might be thrown off by power usage characteristics varying based on number of cores working? How does one even begin to write code that prioritizes power-efficiency over performance?


For short runs, any fixed-cost warmup/cooldown periods might dominate - a latency/throughput tradeoff kind of affair. (I am just shooting my mouth off and have no idea whether that's the case here.)

As for how to do it, I'm not sure... but if I were a valued customer of various CPU suppliers, and were famous for the depth of my pockets, I'm sure I'd be able to find somebody to explain it to me ;)


Can you repeat those tests using the library shipping with Mac OS X?

I'm asking because Apple claims the github code is a reference implementation. Those typically aren't tuned for speed or efficiency.


Fast == power efficient, though.

Not surprising. Apple-originated tech is always kind of middle of the road.


Apple's goals were also to have a low-energy de/compressor suitable for mobile. I'd love to see some comparisons of the two of them running on ARM.


Is low energy significantly different from high performance code? I thought the general advice was make the code as fast as possible so the work can be finished and the CPU slow/power down again.

Are there cases where an algorithm takes 10x the wall clock time to execute, but actually uses less energy on the same chip?

(Memory use/access is the main thing I guess that could be different.)


>> Is low energy significantly different from high performance code?

It can be. Certain operations are more power efficient than others - subtraction and then a check for negative value instead of comparison, certain vector operations, loop unrolling, using less memory/bus traffic...

>> Are there cases where an algorithm takes 10x the wall clock time to execute, but actually uses less energy on the same chip?

Slower code can be more power efficent. You're just tuning for different results.


Interesting. Do you know of any open source examples offhand?

I'd imagine this also requires very detailed knowledge of the chip, microcode, etc. Probably hard for x86 but I guess this kinda stuff would be on ARM where the programmer has deep vertical access (like as mentioned, Apple).


I'd love to see that too. Remember that if the processor uses 2x the power but gets done 3x faster, it's still 1.5x more efficient overall.


Thats what I was thinking. I haven't found any validation or even the rationale behind lzfse's supposed lower power usage. I can think of two things.

1. Apple tried to write a fast and reasonably compressible version of LZ4 thus improving power usage by creating LZFSE since none existed but beaten out handsomely by Zstandard.

2. Following a parent comment, Zstandard might some of the things that are dependent on a highly OoO cpu with lots of caches, extremely good branch predictor that could be significantly slower on an ARM even the apple one despite how good they are. Or they could still be slower but on ARM the gap might not be as big and the decision not as cut and dry as it seems now.

Would love to know what the actual case is from someone involved in LZFSE.


From the bits of testing I've done today, it's phenomenally fast on x86. Much better than gzip (and pigz for that matter) in every metric I think I generally care about: CPU Usage, Compression Speed, Decompression Speed, Compression Ratio.

On other architecture the picture gets a bit murky, it seems to get handily beaten by pigz through what at first blush I'd guess is just sheer parallelism. It's got solid performance, and without a shadow of doubt faster than vanilla gzip. If/as/when I get time, it'll be interesting to dig into why performance is worse there.


Dug in a bit further. On the non-x86 architecture I use, it looks like it's really just straight core performance that explains it. pigz's only advantage there really seems to be the brute force parallelism.

In particular note the huge difference in branches between gzip and zstd on decompress:

8959780663 branches # 143.024 M/sec

2969481781 branches # 64.454 M/sec

and on misses:

542158823 branch-misses # 6.05% of all branches

89060880 branch-misses # 3.00% of all branches


There is now a pzstd implementation, if you wish to compare it to pigz


Had a shot. It's slightly buggy, but holy crap is it fast.

I'm not a C programmer, understanding what happened is a bit beyond me but: 1) to compile on linux it needs the -pthread flag passed to it, Makefile is missing that (compiles fine on OS X) 2) decompression over stdin appears to be effectively impossible, still demands in input file. Compression over stdin works fine.


This is an awesome blog post that is very well written, but the lack of incompressible performance analysis prevents It from providing a complete overview of zstd.

Incompressible performance measurements are important for interactive/realtime workloads and the numbers are extremely interesting because they can differ dramatically from the average case measurements. LZ4 for instance has been measured at doing 10GB/sec on incompressible data on a single core of a modern Intel Xeon processor. At the other end of the spectrum is the worst case scenario for incompressible data where performance slows to a crawl. I do not recall any examples in this area, but the point is that it is possible for algorithms to have great average case performance and terrible worst case performance. Quick sort is probably the most famous example of that concept.

I have no reason to suspect that zstd has bad incompressible performance, but the omission of incompressible performance numbers is unfortunate.


zstd goes at > 1 GB/s on uncompressible data. It has some fast heuristics for such cases too.


A recent compression discussion I saw involved how do compressors fare on uncompressible input? For example, suppose you wanted to add compression to all your outbound network traffic. What would happen if there was mixed compressible traffic along with the uncomressible kind? A common case would be sending HTML along with JPEG.

Good compressors can't squeeze any more out of a JPEG, but they can back off fast and go faster. Snappy was designed to do this, and even implementations of gzip do it too. It greatly reduces the fear of CPU overhead to always on compression. I wonder how Zstd handles such cases?

*Ignoring security altogether


Zstd pass faster over incompressible data. Expect something > 1 GB/s


Dropbox say they can losslessly compress JPEG files (~20% saving):

https://blogs.dropbox.com/tech/2016/07/lepton-image-compress...


Known techniques. JPEG uses Huffman coding for entropy coding, you can replace that with arithmetic coding. This requires knowing the format details, though.


>> "It is written in highly portable C, making it suitable for practically every platform used today"

I love C, it is not the enemy everyone makes it out to be.

It's already in debian: https://packages.debian.org/stretch/zstd and judging by the small requirements,it is portable indeed.


Quick benchmark on a 194MiB SQL dump:

    gzip -9: 27.574s, 48MiB output
    zstd -9: 14.182s, 41MiB output
Thanks, I'll gladly use zstd as a drop-in replacement for my daily backups. :)


Isn't it a bit presumptuous to call your own thing "standard"?


Like these guys: https://github.com/feross/standard, which is permissive to the point it's completely pointless. You can follow it to the letter and still end up with vomit code.


Really nice work compared to what I consider to be the quite bad Brotli -- an incredibly slow compression standard that only ended up in browsers because it was created by Google.


except for the PATENTS file =/


Curious what your issue is with it -- it basically says "if you dont sue us, we wont sue you". Thats about as good as I can expect from a large tech company these days with regards to patents.


On a cursory glance, the VP9 patent license is much nicer: it only gets voided if you sue about VP9.

That being said: has anyone actually given any indication that there are any relevant patents in the first place?


But doesn't that mean that Facebook can use any of your patents and you cannot sue them unless you don't use zstd?


Compare it to the Opus patent license (also including a retaliation clause), which includes grants from Broadcom, Mozilla and Microsoft.

Using zstd gives Facebook a free license on ALL your patents.

Using Opus gives Facebook only a license on patents that apply to Opus.

So no, large tech companies can and have given MUCH better grants for compression tech, than Facebook is doing.


If Facebook wanted zstd in the browsers, which would make sense for them to be able to reduce bandwidth and improve performance, the patent grant seems to make that impossible: Google would never put zstd in Chrome with such a clause.


I don't understand these things well. But if true this would be really bad. LZ4 is everywhere because it was completely free and I think most of Zstandard's real work was pre-facebook by just Yann alone. Now to have its hand tied because of his job at facebook is the worst thing possible.

I just read the Opus patent summary. It seems like if zstd followed the same license using it wouldn't give facebook any license but if I sue facebook I loose the licese to use zstd. Am I correct in that.


Now to have its hand tied because of his job at facebook is the worst thing possible.

If you accept the job you give up your work. It's not like there's no compensation.

Am I correct in that.

Yes, the retaliation (and hence, implicit license to Facebook) is only for the technology itself.


Well, I suppose so, provided that you hold patents and want to use them offensively.

Because as far as I can see, Facebook has an exception saying that if they for some reason patent-sue you first, you would be allowed to countersue without losing your license to their patents.

That could be a misinterpretation, I suppose. But if it's right, then Facebook's license seems superior for those parties who wish to end software patents.


Facebook's license seems superior for those parties who wish to end software patents

It's mostly superior for Facebook. If you're a party that wants to end software patents, but intends to use zstd in any place you might want to interface with a company that doesn't hold the same position, then you're screwed.

If you want to end software patents, and Facebook sues with one (not applying to zstd), you're also still screwed. This is relevant because even if you're against software patents, you can take them out defensively. But this license makes that useless.

Compare to GPL vs LGPL, or how the free software codecs all eventually moved to BSD.


I think for typical JS/CSS/HTML sizes, and decompression times, probably maximum compression ratio, followed by decompression speed is what I'd look for. I don't care too much about compression speed, in the sense that if I have to spend 1 minute compressing JS to crunch it by 10%, but I serve that file a million times, then as long as decompression doesn't negate the gain in network time saved, it's a win.

I guess the other factor for mobile is, besides memory and decompression speed, how do various compression schemes fare battery wise?


Regarding decompression times, I think it's much more important to save transferred network data than to prioritize quicker decompression.

HTTP request times have a long tail; they can get really slow for the many, many people limited to slow connections. Decompression times are going to be much more consistent. Our aim here should be to improve the 10% slowest requests, and you do that by optimizing the actual transfer.


For this to become anything like a standard, Facebook would have to remove its patent poison pill.


If anyone wants to try it on windows there is a 7-zip install with support for ZSTD

https://mcmilk.de/projects/7-Zip-zstd/


If facebook hopes the new compression algorithm to be a standard, why doesn't it publish an IETF RFC draft? Will it follow OpenDNS way of dnscrypt by open-sourcing the reference implementation without publishing any IETF RFC draft?


The following link points to a fairly good benchmark / tool that showcases the tradeoffs in real life: since (de)compression takes time, what is the fastest way to transmit data at a given transfer speed?

https://quixdb.github.io/squash-benchmark/unstable/#transfer...

Spoilers: zstd wins at ethernet and wifi (and is among the best in 4G), lz4 wins at hard drive encryption… both were designed by the same author.


Charles Bloom’s Oodle new codecs handily beat them across the scale:

http://cbloomrants.blogspot.com http://www.cbloom.com/oodle_arm_report/ http://www.cbloom.com/rants.html

Caveat: proprietary commercial product.


I was looking for a windows version of zstd. On the github page, I could only get the version 0.81 version of the windows tool. Can someone release the 1.0 version of zstd for windows ?


How difficult is this new standard going to be to implement in another language? It seems highly sophisticated -- which is great, of course -- but the cost of that is relying on giants like Facebook to maintain their One True Implementation. For software this is (usually) fine; for a nee standard, it's a problem.


The format itself is documented (https://github.com/facebook/zstd/blob/master/zstd_compressio...) with the intention of other implementations and language bindings being readily available. We also have a zlib-compatible API for easier porting to applications already using Zlib. Our hope is that Zstandard is both easy to use and easy to contribute to.


The majority of the work was done by a single person though. I think by the time facebook joined the majority design was done. Granted he doesn't seem to be like a normal person so doesn't matter that way.

But for compression isn't majority of the language would actually use bindings rather than implementing it natively for performance to make this moot.

I would assume the road to finding the optimal solution and understanding why it worked is much more complicated than the actual code. And a quick look doesn't seem to suggest its a very big code base for the library itself.


That's truly beautiful. Thanks, Facebook! I particularly love that you can pre-compute and reuse dictionaries, say if you're regularly compressing similar JSON objects.


Zstandard is both a command line tool (zstd) and a library.


I'd love HTTP content encoding to support this and see a comparison to Brotli. Looks like it might be yet another good alternative to gzip.


Is this being pushed as being standard as part of the HTTP spec, seeing that it comes from Facebook?


Just about anything is better than brotli.


turbohf claims to be 4x faster than zlib's huffman coding and 2x faster than FSE and is a generic cpu implementation. Even if claims are only partially true and turbohf is a clean dropin replacement for zlib and licensing were friendly the appeal of zstd drops substantially in my book.

http://encode.ru/threads/2276-TurboHF-1GB-s-Huffman-Coding-R...

https://sites.google.com/site/powturbo/entropy-coder


Unless they integrate it into software like web servers and web browsers it will be hard to see it really flourish as a "standard".

But at least within the perimeter of your own systems you can totally profit from this technology now.


Beautiful.


Tables should be in HTML.


Looks very interesting, however I'm not impressed by the name. "Zstandard"??? With ".zstd" as the extension? I don't like it.

They should have named it letter-zip, along the lines of gzip, bzip, and xzip, with the extension letterz. "fz" would have been a good one since they work at Facebook.



This isn't bikeshedding. Bikeshedding is about quibbling over unimportant details. Names are critically and absolutely important. Lots of great things have been hobbled or ruined by poorly-chosen names. A terrible name can cause something worthy to be ignored in favor of something inferior but with a better name. And you don't need to even be competent in the inner workings of a project to criticize its name or suggest better ones. The people who name cars aren't the same people who design the engine-control algorithms for them.

If you disagree, what do you think of naming your kid "Adolph Hitler [lastname]"? Most people agree that a name like that will cause great harm to a child growing up because of the constant ridicule and ostracization he'd inevitably face. That's an extreme example, but names are important.

Note that I don't think this project's name is horrible, but I don't think it's very good either, and could be a lot better.



It used to be called just 'zstd' but I guess that wasn't very pronounceable.


Zesty!


Zed-Stud


...yeah, why didn't they claim part of a namespace that only has room for 26 (or 36) things? Everyone else is doing it!


I don't see the problem: only 3 members of that namespace are currently claimed (4 out of the 36-member namespace: 7z), so we have room for 23 (or 32) more compression standards before running out. We've been using gzip for, what, 20 years now? Only recently have we gotten xz. At this rate, we won't run out of compression standards using this scheme for roughly 153 years. And after that, we could always start using capital Zs, like the old "compress" standard that used the .Z extension. Or we could go to a 3-letter extension ending in z, such as ".fbz", which gives you 676 more options, and 4507 years. Considering that general-purpose data compression really hasn't moved that much since the DEFLATE algorithm took over, and only recently had any real change with the advent of LZMA (used in p7zip and xz), and perhaps this new zstd (too soon to tell), my time estimates here are probably too short.


Perhaps we should require the less common algorithms to have a longer prefix.


There's no way to know what's going to be a common algorithm in the future unless you have a time machine. When DEFLATE was first invented, it wasn't common either, it was brand-new. Now it's everywhere. This new algorithm might become just as ubiquitous in 10 years, or it might turn into the next bzip2, or worse, the next ZOO.


You're right, of course. So we should base it on how common the algorithm was in the compressions we've done so far.

(I am not sure if i need to clarify that my original comment was meant to be a joke about Huffman coding)


Agreed. How did they not call it "Pied Piper"?


I was waiting for that reference.


[flagged]


Please don't do this, comments on HN need to be substantive.


How does this compete with PiedPiper?


This is general purpose whereas Pied Piper, Hooli et al. focussed on video.


Awesome


Best middle-out in the game.


Whats their weissman score?


I bet it doesn't even touch pied piper's.


Probably nictpicking but "Smaller data compression" makes no sense really


Should we start with the pied piper jokes now or later?


Not until I know the Weissman Score of Zstandard.


Should've named it "Pied Piper"


They should've named it "Pied Piper".




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: