Ziplm: Gzip-Backed Language Model

canjobear · on July 15, 2023

Oh crud, I made the front page.

It should be obvious that this is junk quality as a language model. But it's a cool example of the equivalence between compression codes and probability distributions, so I hope people find it interesting for that reason. You can also get a bit of an intuitive sense for the patterns that gzip and bzip2 pick up on in text (they like repetitive strings).

dleeftink · on July 15, 2023

I think it is a cool 'tangible' demo of how non-human readable data remains learnable for ML purposes. I've recently tried using Nilsimsa and sdeep hashes to reduce the input dimensionality of natural text, with some loss in accuracy on similarity tasks.

A combination of byte-pair encoding and local-sensitive hashing might prove a more stable combination. Throw some gzip function that way and we may be able to reduce input corpus file sizes immensely.

up2isomorphism · on July 15, 2023

Why does this needs to be "demoed"? The weights for ML inference is obviously "binary".

hedgehog · on July 15, 2023

Maybe this was the inspiration for your project but I just saw a paper that applied compressors as part of a text classification system getting accuracy competitive with BERT. 14 lines of Python. https://aclanthology.org/2023.findings-acl.426/

canjobear · on July 15, 2023

Yeah, that paper made me wonder how well gzip would work as a language model so I typed up this thing in a coffee shop

sam_lowry_ · on July 15, 2023

Reminds immediately this famous article by Ted Chiang "ChatGPT Is a Blurry JPEG of the Web " https://www.newyorker.com/tech/annals-of-technology/chatgpt-...

DonHopkins · on July 15, 2023

I'm looking forward to Progressive ChatGPT, that immediately answers your question with an thoughtfully chosen emoji, that expands into an irresistible clickbait title, that expands into a viral tweet, that expands into a fascinating forum comment, that expands into a ranting and raving blog post, that expands into a ponderous wikipedia page, that expands into a voluminous e-book, that expands into ...

dleeftink · on July 15, 2023

Thank you for posting, I enjoyed the article. In true cybernetics fashion, the maximum compression ratio of information seems to be the current grail of our machine learning efforts, whether that is through inferring heuristics or gradually incorporating knowledge about the world.

I'm interested to see where on the information density gradient various organisms exists in their ability to do so..DNA seems awfully wordy, but has proven to be a very dense storage medium.

visarga · on July 15, 2023

There is also an approach to use a LLM and only compress the errors from LLM to the target text. You'd have to add the LLM to the final size, but you could have it already downloaded at the target, in which case I expect nothing will beat it for text compression. Of course it would be very slow.

visarga · on July 15, 2023

Yes, but this paper was about classification, not language generation.

manmal · on July 15, 2023

A maybe dumb question for understanding: does a longer output imply lower probability of the input?

usgroup · on July 15, 2023

its not a dumb question -- optimal coding in theory assigns codes of shorter length to more frequently occurring tokens hence by design longer codes imply lower probabilities: that's how compression is achieved.

DarmokJalad1701 · on July 15, 2023

I am not sure if it was intentional, but in your "Moby Dick" example, the "alphabet" vocab is missing the letter 'f'.

canjobear · on July 15, 2023

Ha, yeah. I was just running my finger along the keyboard to type out the string.

I just regenerated it, btw, and got a better looking result.

Twirrim · on July 15, 2023

Python can give you all the characters via the constants on the string stdlib: https://docs.python.org/3/library/string.html?highlight=stri..., e.g.:

    Python 3.10.6 (main, May 29 2023, 11:10:38) [GCC 11.3.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import string
    >>> string.printable
    '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

usgroup · on July 15, 2023

i suggest you get all the unique characters from the moby dick text and use it as the alphabet if you're generating from it, or truncate the text to just your hand selected characters else gzip cannot even approximate an optimal code if the alphabets don't match.

fnordpiglet · on July 15, 2023

I think this would work better if you used a fixed compression dictionary trained over the corpus and used to compress the prompt. Unless you do that? I wouldn’t expect this to work well for such a small training set. But I do wonder if this is an excellent way to improve token resolution; especially in poorly represented languages like southeast Asian.

userbinator · on July 15, 2023

Repeating this experiment with a more optimal compression algorithm might be interesting too.

versteegen · on July 15, 2023

Sure! The current best achieved compression for English text (well, specifically, Wikipedia articles) is Fabrice Bellard's nncp, which uses... transformers. http://mattmahoney.net/dc/text.html#1085

Unfortunately I don't think anyone has posted any trained models after running any inputs through it, which could be used to repeat the experiment. I would try it, but I don't have a AVX2 CPU.

However, most research into reducing the model perplexity of text (increasing the compression ratio when the model is used for encoding) uses fixed language models which are learnt offline. Obviously, they use transformers too. You can see some of them here:

https://paperswithcode.com/sota/language-modelling-on-wikite...

However (token) perplexity is only comparable across models if the same tokenizer is used. The above page seems to compare papers using the GPT-2 tokenizer. I would assume that if you controlled for differences in tokenizers by just measuring bits-per-word/byte, GPT-4 without RLHF would achieve the highest compression. It should be obvious that RLHF increases perplexity because it changes the training objective to something other than predicting text.

vintermann · on July 15, 2023

If I understand it correctly, this implementation re-compresses the prefix 256 times, once for each possible byte continuation. It should be possible to save a lot of work by saving the state of the compressor just before the predicted byte, and reusing it.

Also, since gzip works on a limited sliding context window, it's probably pointless to give it a prefix much longer than the window... unless there's something about the algorithm that I'm missing. Same with bzip2 and its block size.

canjobear · on July 15, 2023

I'd be happy to work on improving this with someone who understands the internals of the compressors.

nerdponx · on July 15, 2023

In some sense, the feature space learned by a NN is a compression algorithm.

canjobear · on July 15, 2023

It doesn't look qualitatively any better with bzip2 or lzma.

ronsor · on July 15, 2023

I think you would get slightly better generations from the model if the input data were tokenized and token IDs were sampled instead.

canjobear · on July 15, 2023

Possibly. It would definitely make it even slower though. To sample from the model, if you have a prompt/training data c, you have to compute the compressed length of all strings cx for tokens x in your vocabulary. If you have a long prompt c (like 1 million lines of wikipedia) and a big vocabulary of tokens (like 100,000 different words), you're going to have a bad time.

vintermann · on July 15, 2023

It should be possible to save and reuse the state of the compressor right before the final byte.

gr0m · on July 15, 2023

Have you tried beam search as well?

p-e-w · on July 15, 2023

It should be noted that Moby Dick is (in)famous for its very large and unusual vocabulary. Some lighter reading might yield better results for demonstration purposes. Also, while fairly long for a single novel, the number of words is still minuscule compared to what other LMs are trained on. Using the entire Gutenberg library, or a Wikipedia dump, could improve the quality dramatically.

Of course, it doesn't really matter, as the whole thing is obviously just a toy, but I still think that this approach should be able to produce much better output than the garbled Moby Dick example.

boberoni · on July 15, 2023

"An early version of a new project will sometimes be dismissed as a toy. It's a good sign when people do this. That means it has everything a new idea needs except scale, and that tends to follow." [1]

[1] http://www.paulgraham.com/greatwork.html

optimalsolver · on July 15, 2023

Strong disagree. In AI, many cool ideas blow up when meeting anything close to real world inputs.

pseg134 · on July 15, 2023

That is the perfect attitude for you then. Please ignore this toy and use only the Webscale tech blessed by google.

jdiff · on July 15, 2023

This is a horribly disingenuous and dismissive response. Sometimes good ideas fundamentally just don't work, they miss foundational issues that make them incompatible with reality. That just happens sometimes. Not everything can be polished up and made into something useful.

westurner · on July 15, 2023

> Transformers (GPT-3, Copilot ,) are built upon an expensive self-attention network. Instead or also FFT looks to be 7x GPU-cheaper.

"Google Replaces BERT Self-Attention with Fourier Transform: 92% Accuracy, 7 Times Faster on GPUs" (2021) https://syncedreview.com/2021/05/14/deepmind-podracer-tpu-ba...

The next step of Conway's Game can be calculated with FFT and also 2D Convolution.

Convolution > Visual explanation: https://en.wikipedia.org/wiki/Convolution

westurner · on July 15, 2023

"Convolutions / Why X+Y in probability is a beautiful mess" by 3blue1brown https://youtu.be/IaSGqQa5O-M

idbfs · on July 15, 2023

Cool! I had been thinking about trying this as well, after reading about the idea in one of Cosma Shalizi's notebooks [0]. I'd love to see how something like this performs when "trained" on a corpus the size of the web when given the same kind of computational resources used to train modern LLMs.

[0] http://bactra.org/notebooks/nn-attention-and-transformers.ht...

jll29 · on July 15, 2023

This is a nice application of the idea that understanding and compression are equivalent, which is the inspiration behind the Hutter prize (500 K€ for whoever can compress Wikipedia best, see http://prize.hutter1.net/ - the current state of the art is about 15 MB).

Compression can also be used as "machine learning method": but all data vectors with the same class label into a separate bucket; any new, unseen and unlabeled data item can be added to each bucket. The most likely class is the one the bucket of which grows the least when compressed after adding it. The University of Waikato group (Ian Witten and co-workers) did a fair amount of that kind of work, perhaps first, e.g. Frank, Eibe, Chang Chui and Ian H. Witten (2000) "Text categorization using compression models", https://www.cs.waikato.ac.nz/~eibe/pubs/Frank_categorization... - yes, published 23 years ago!).

Pannoniae · on July 15, 2023

(small correction: the current record is 115MB, not 15MB)

nunesvn · on July 15, 2023

(small clarification: it’s not the complete wikipedia, just a 1GB slice of it)

puttycat · on July 15, 2023

For anyone interested in the idea of compression as learning combined with neural nets, a random list of interesting work:

1. Schmidhuber's classical work applying a time-based Kolmogorov complexity to neural nets

https://pubmed.ncbi.nlm.nih.gov/12662875/

2. Applying a form of Kolmogorov complexity to learn formal languages using RNNs

https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00489...

3. Hinton and Van Camp - Keeping the neural networks simple by minimizing the description length of the weights

https://dl.acm.org/doi/10.1145/168304.168306

bravura · on July 15, 2023

Compression based ML approaches and hacks are fascinating.

My colleague recently sent me this: https://arxiv.org/abs/2212.09410

"Less is More: Parameter-Free Text Classification with Gzip

"...We propose a non-parametric alternative to DNNs that's easy, light-weight and universal in text classification: a combination of a simple compressor like gzip with a k-nearest-neighbor classifier. Without any training, pre-training or fine-tuning, our method achieves results that are competitive with non-pretrained deep learning methods on six in-distributed datasets. It even outperforms BERT on all five OOD datasets, including four low-resource languages. Our method also performs particularly well in few-shot settings where labeled data are too scarce for DNNs to achieve a satisfying accuracy."

usgroup · on July 15, 2023

Hey author, would you mind explaining the conversion from compression length to probability please? Namely this line:

scipy.special.log_softmax(-code_lengthsself.conversion(1/temperature))

Your codes are K-ary but this doesn't look like its taken into account ala the README. What is the log(256) conversion factor? What is 1/temperature for?

canjobear · on July 15, 2023

I’m measuring the length of the gzipped string in bytes, so K=256.

The temperature parameter is there in case anyone wants to play around with it.

usgroup · on July 15, 2023

For anyone else wondering how this works out:

1. You want: p(x) ~ K^(-|x|), where K=256.

2. log p(x) ~ log K^(-|x|) = -|x|log K

3. he is using log(softmax) ~ log(e^x)

4. and log(e^(-|x|log(K))) = -|x|*log K as required.

Vetch · on July 15, 2023

idbfs has posted this link already but did not explain that Shalizi provides a deep theoretical explanation for why universal source coding (does not require information about symbol distribution or statistics) such as Lempel ziv derived compression algorithms can serve as powerful language models if practical restrictions on them such as dictionary and input corpus size are lifted.

This is a good example of how old methods can be pushed quite far if similar resources were devoted to them. Who knows, they might even posses advantages hitherto unmet due to a lack of exploring at larger scales.

That said, Transformers have a number of practical advantages. The learned projection matrices in attention lend Transformers a dynamic adaptability with respect to learned patterns that help make them programmable by their context, able to work out patterns present in context zero shot and on the fly. gzip based language models will be limited to their dictionary of patterns. The underlying vector space of neural language models also makes semantics more readily learnable (driving novel synthesis such as neologisms and more) while feed forward layers can learn a large range of computations.

http://bactra.org/notebooks/nn-attention-and-transformers.ht...

erostrate · on July 15, 2023

Fabrice Bellard did the converse, using a language model as a compressor: https://bellard.org/nncp/

isoprophlex · on July 15, 2023

Better write a pull request to have this added to LangChain :^) Eternal, VC money backed, fame awaits!

bionhoward · on July 15, 2023

The author implies this is a useless toy but looking at the formula shows the one for Solomonoff’s Universal Prior. That makes this a bit beyond a useless toy. Great idea!

th0ma5 · on July 15, 2023

Previously, a similar work with great performance: https://news.ycombinator.com/item?id=36707193

diagramist · on July 15, 2023

This can be improved by changing the alphabet and the corresponding codes for the text. If the alphabet consisted of digrams then the output would be much more coherent.

dkatz23238 · on July 15, 2023

What would happen if each word in "tokenized" to an integer and then you generate tokens instead of characters to produce a string of coherent words instead of random strings? Maybe the answer is obvious but not to me without diving into it at a deeper level. Would be interested to hear anyones thoughts on this.

asim · on July 15, 2023

I was curious as to whether this would work. Good to see people trying new things. Because the next is to feed it PGP or AES-256 encrypted data and hand it a symmetric key. Just sort of an attempt to secure what lives in the model itself.

nialv7 · on July 15, 2023

Conversely, language models can also be used for compression. For example, https://bellard.org/nncp/

the8472 · on July 15, 2023

https://gwern.net/scaling-hypothesis#why-does-pretraining-wo...

djxfade · on July 15, 2023

Perhaps a stupid question, but is this kinda thing related to how Markov chains work?

OriPekelman · on July 15, 2023

Basically "Information theoretically" compression is a measure of informational distance. So basically if text A and Text B contactenated together, compress better than text A and text C it means A and B repeat more patterns. and are closer. All we need are some distance functions .. in a way you can think about it like Levenhstein distance but that can take into account inputs with very different sizes, repetitions, changes in order big inserts etc...

Reminds me of a mostly joke ruby project I did a decade ago https://github.com/oripekelman/simple_similarity

giuscri · on July 15, 2023

Can someone explain?

29athrowaway · on July 15, 2023

What about zstd?