Hacker News new | past | comments | ask | show | jobs | submit login

Reading the sentencepiece paper they say:

> The main difference to other compression algorithms, such as Huffman encoding, which have been proposed to produce a variable-length encoding of words for NMT (Chitnis and DeNero, 2015), is that our symbol sequences are still interpretable as subword units, and that the network can generalize to translate and produce new words (unseen at training time) on the basis of these subword units.

I don't see why Huffman encoding doesn't give you that same interpretability?

Actually the algorthm for producing a Hoffman tree is very similar to that for BPE:

> The process begins with the leaf nodes containing the probabilities of the symbol they represent. Then, the process takes the two nodes with smallest probability, and creates a new internal node having these two nodes as children. The weight of the new node is set to the sum of the weight of the children. We then apply the process again, on the new internal node and on the remaining nodes (i.e., we exclude the two leaf nodes), we repeat this process until only one node remains, which is the root

(from https://en.m.wikipedia.org/wiki/Huffman_coding)

I guess the issue is that Huffman requires the alphabet to be predefined, where BPE "discovers it" as it goes along.




> I don't see why Huffman encoding doesn't give you that same interpretability?

It might just be that a Huffman encoding is a bit-string and not a byte-string.

BPE encoding causes interesting failures, like how it can't do anagrams or spell words backwards properly. And yet it can make rhyming poems now.


> BPE encoding causes interesting failures, like how it can't do anagrams or spell words backwards properly. And yet it can make rhyming poems now.

I don't think BPE encoding makes anagrams impossible. Just harder.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: