Hacker News new | past | comments | ask | show | jobs | submit login

Not just tuned for web workloads in general, but for specific web workloads. The Brotli dictionary is mostly composed of English words and phrases, and fragments of HTML, CSS, and Javascript. It would perform poorly on non-English text.

I have a feeling that the dictionary was designed with the specific goal of performing well on a specific corpus similar to the Large Text Compression Benchmark[1]. It has quite a few words and phrases that I'd associate with Wikipedia's "house style".

[1]: https://cs.fit.edu/~mmahoney/compression/textdata.html




The dictionary also contains lots of Chinese, Russian and Arabic


By the numbers:

- 9216 phrases total

- 5857 (63.5%) pure ASCII phrases, mostly English with a few Spanish words thrown in

- 1372 (14.8%) code fragments -- mostly HTML, CSS, and Javascript

- 1027 (11.1%) CJK (Chinese, Japanese, and Korean, and mostly the first two) phrases -- it's very hard to tell Chinese and Japanese apart in this context; I didn't try

- 158 (1.7%) phrases containing extended Latin-1 characters (nearly all Spanish words)

- 303 (3.3%) Cyrillic script (probably Russian) phrases

- 322 (3.5%) Arabic phrases

- 172 (1.9%) Devanagari script (Hindi) phrases

Plus a few miscellaneous other scripts and generally unclassifiable content.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: