Not just tuned for web workloads in general, but for specific web workloads. The Brotli dictionary is mostly composed of English words and phrases, and fragments of HTML, CSS, and Javascript. It would perform poorly on non-English text.
I have a feeling that the dictionary was designed with the specific goal of performing well on a specific corpus similar to the Large Text Compression Benchmark[1]. It has quite a few words and phrases that I'd associate with Wikipedia's "house style".
- 1027 (11.1%) CJK (Chinese, Japanese, and Korean, and mostly the first two) phrases -- it's very hard to tell Chinese and Japanese apart in this context; I didn't try
I have a feeling that the dictionary was designed with the specific goal of performing well on a specific corpus similar to the Large Text Compression Benchmark[1]. It has quite a few words and phrases that I'd associate with Wikipedia's "house style".
[1]: https://cs.fit.edu/~mmahoney/compression/textdata.html