It should be obvious that this is junk quality as a language model. But it's a cool example of the equivalence between compression codes and probability distributions, so I hope people find it interesting for that reason. You can also get a bit of an intuitive sense for the patterns that gzip and bzip2 pick up on in text (they like repetitive strings).
I think it is a cool 'tangible' demo of how non-human readable data remains learnable for ML purposes. I've recently tried using Nilsimsa and sdeep hashes to reduce the input dimensionality of natural text, with some loss in accuracy on similarity tasks.
A combination of byte-pair encoding and local-sensitive hashing might prove a more stable combination. Throw some gzip function that way and we may be able to reduce input corpus file sizes immensely.
Maybe this was the inspiration for your project but I just saw a paper that applied compressors as part of a text classification system getting accuracy competitive with BERT. 14 lines of Python. https://aclanthology.org/2023.findings-acl.426/
I'm looking forward to Progressive ChatGPT, that immediately answers your question with an thoughtfully chosen emoji, that expands into an irresistible clickbait title, that expands into a viral tweet, that expands into a fascinating forum comment, that expands into a ranting and raving blog post, that expands into a ponderous wikipedia page, that expands into a voluminous e-book, that expands into ...
Thank you for posting, I enjoyed the article. In true cybernetics fashion, the maximum compression ratio of information seems to be the current grail of our machine learning efforts, whether that is through inferring heuristics or gradually incorporating knowledge about the world.
I'm interested to see where on the information density gradient various organisms exists in their ability to do so..DNA seems awfully wordy, but has proven to be a very dense storage medium.
There is also an approach to use a LLM and only compress the errors from LLM to the target text. You'd have to add the LLM to the final size, but you could have it already downloaded at the target, in which case I expect nothing will beat it for text compression. Of course it would be very slow.
its not a dumb question -- optimal coding in theory assigns codes of shorter length to more frequently occurring tokens hence by design longer codes imply lower probabilities: that's how compression is achieved.
Python 3.10.6 (main, May 29 2023, 11:10:38) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import string
>>> string.printable
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'
i suggest you get all the unique characters from the moby dick text and use it as the alphabet if you're generating from it, or truncate the text to just your hand selected characters else gzip cannot even approximate an optimal code if the alphabets don't match.
I think this would work better if you used a fixed compression dictionary trained over the corpus and used to compress the prompt. Unless you do that? I wouldn’t expect this to work well for such a small training set. But I do wonder if this is an excellent way to improve token resolution; especially in poorly represented languages like southeast Asian.
Sure! The current best achieved compression for English text (well, specifically, Wikipedia articles) is Fabrice Bellard's nncp, which uses... transformers. http://mattmahoney.net/dc/text.html#1085
Unfortunately I don't think anyone has posted any trained models after running any inputs through it, which could be used to repeat the experiment. I would try it, but I don't have a AVX2 CPU.
However, most research into reducing the model perplexity of text (increasing the compression ratio when the model is used for encoding) uses fixed language models which are learnt offline. Obviously, they use transformers too. You can see some of them here:
However (token) perplexity is only comparable across models if the same tokenizer is used. The above page seems to compare papers using the GPT-2 tokenizer. I would assume that if you controlled for differences in tokenizers by just measuring bits-per-word/byte, GPT-4 without RLHF would achieve the highest compression. It should be obvious that RLHF increases perplexity because it changes the training objective to something other than predicting text.
If I understand it correctly, this implementation re-compresses the prefix 256 times, once for each possible byte continuation. It should be possible to save a lot of work by saving the state of the compressor just before the predicted byte, and reusing it.
Also, since gzip works on a limited sliding context window, it's probably pointless to give it a prefix much longer than the window... unless there's something about the algorithm that I'm missing. Same with bzip2 and its block size.
Possibly. It would definitely make it even slower though. To sample from the model, if you have a prompt/training data c, you have to compute the compressed length of all strings cx for tokens x in your vocabulary. If you have a long prompt c (like 1 million lines of wikipedia) and a big vocabulary of tokens (like 100,000 different words), you're going to have a bad time.
It should be noted that Moby Dick is (in)famous for its very large and unusual vocabulary. Some lighter reading might yield better results for demonstration purposes. Also, while fairly long for a single novel, the number of words is still minuscule compared to what other LMs are trained on. Using the entire Gutenberg library, or a Wikipedia dump, could improve the quality dramatically.
Of course, it doesn't really matter, as the whole thing is obviously just a toy, but I still think that this approach should be able to produce much better output than the garbled Moby Dick example.
"An early version of a new project will sometimes be dismissed as a toy. It's a good sign when people do this. That means it has everything a new idea needs except scale, and that tends to follow." [1]
This is a horribly disingenuous and dismissive response. Sometimes good ideas fundamentally just don't work, they miss foundational issues that make them incompatible with reality. That just happens sometimes. Not everything can be polished up and made into something useful.
Cool! I had been thinking about trying this as well, after reading about the idea in one of Cosma Shalizi's notebooks [0]. I'd love to see how something like this performs when "trained" on a corpus the size of the web when given the same kind of computational resources used to train modern LLMs.
This is a nice application of the idea that understanding and compression are
equivalent, which is the inspiration behind the Hutter prize (500 K€ for whoever can compress Wikipedia best, see http://prize.hutter1.net/ - the current state of the art is about 15 MB).
Compression can also be used as "machine learning method": but all data vectors with the same class label into a separate bucket; any new, unseen and unlabeled data item can be added to each bucket. The most likely class is the one the bucket of which grows the least when compressed after adding it. The University of Waikato group (Ian Witten and co-workers) did a fair amount of that kind of work, perhaps first, e.g. Frank, Eibe, Chang Chui and Ian H. Witten (2000) "Text categorization using compression models", https://www.cs.waikato.ac.nz/~eibe/pubs/Frank_categorization... - yes, published 23 years ago!).
"Less is More: Parameter-Free Text Classification with Gzip
"...We propose a non-parametric alternative to DNNs that's easy, light-weight and universal in text classification: a combination of a simple compressor like gzip with a k-nearest-neighbor classifier. Without any training, pre-training or fine-tuning, our method achieves results that are competitive with non-pretrained deep learning methods on six in-distributed datasets. It even outperforms BERT on all five OOD datasets, including four low-resource languages. Our method also performs particularly well in few-shot settings where labeled data are too scarce for DNNs to achieve a satisfying accuracy."
Your codes are K-ary but this doesn't look like its taken into account ala the README. What is the log(256) conversion factor? What is 1/temperature for?
idbfs has posted this link already but did not explain that Shalizi provides a deep theoretical explanation for why universal source coding (does not require information about symbol distribution or statistics) such as Lempel ziv derived compression algorithms can serve as powerful language models if practical restrictions on them such as dictionary and input corpus size are lifted.
This is a good example of how old methods can be pushed quite far if similar resources were devoted to them. Who knows, they might even posses advantages hitherto unmet due to a lack of exploring at larger scales.
That said, Transformers have a number of practical advantages. The learned projection matrices in attention lend Transformers a dynamic adaptability with respect to learned patterns that help make them programmable by their context, able to work out patterns present in context zero shot and on the fly. gzip based language models will be limited to their dictionary of patterns. The underlying vector space of neural language models also makes semantics more readily learnable (driving novel synthesis such as neologisms and more) while feed forward layers can learn a large range of computations.
The author implies this is a useless toy but looking at the formula shows the one for Solomonoff’s Universal Prior. That makes this a bit beyond a useless toy. Great idea!
This can be improved by changing the alphabet and the corresponding codes for the text. If the alphabet consisted of digrams then the output would be much more coherent.
What would happen if each word in "tokenized" to an integer and then you generate tokens instead of characters to produce a string of coherent words instead of random strings? Maybe the answer is obvious but not to me without diving into it at a deeper level. Would be interested to hear anyones thoughts on this.
I was curious as to whether this would work. Good to see people trying new things. Because the next is to feed it PGP or AES-256 encrypted data and hand it a symmetric key. Just sort of an attempt to secure what lives in the model itself.
Basically "Information theoretically" compression is a measure of informational distance. So basically if text A and Text B contactenated together, compress better than text A and text C it means A and B repeat more patterns. and are closer. All we need are some distance functions .. in a way you can think about it like Levenhstein distance but that can take into account inputs with very different sizes, repetitions, changes in order big inserts etc...
It should be obvious that this is junk quality as a language model. But it's a cool example of the equivalence between compression codes and probability distributions, so I hope people find it interesting for that reason. You can also get a bit of an intuitive sense for the patterns that gzip and bzip2 pick up on in text (they like repetitive strings).