> The main difference to other compression algorithms, such as Huffman encoding, which have
been proposed to produce a variable-length encoding of words for NMT (Chitnis and DeNero,
2015), is that our symbol sequences are still interpretable as subword units, and that the network can generalize to translate and produce new words
(unseen at training time) on the basis of these subword units.
I don't see why Huffman encoding doesn't give you that same interpretability?
Actually the algorthm for producing a Hoffman tree is very similar to that for BPE:
> The process begins with the leaf nodes containing the probabilities of the symbol they represent. Then, the process takes the two nodes with smallest probability, and creates a new internal node having these two nodes as children. The weight of the new node is set to the sum of the weight of the children. We then apply the process again, on the new internal node and on the remaining nodes (i.e., we exclude the two leaf nodes), we repeat this process until only one node remains, which is the root
> The main difference to other compression algorithms, such as Huffman encoding, which have been proposed to produce a variable-length encoding of words for NMT (Chitnis and DeNero, 2015), is that our symbol sequences are still interpretable as subword units, and that the network can generalize to translate and produce new words (unseen at training time) on the basis of these subword units.
I don't see why Huffman encoding doesn't give you that same interpretability?
Actually the algorthm for producing a Hoffman tree is very similar to that for BPE:
> The process begins with the leaf nodes containing the probabilities of the symbol they represent. Then, the process takes the two nodes with smallest probability, and creates a new internal node having these two nodes as children. The weight of the new node is set to the sum of the weight of the children. We then apply the process again, on the new internal node and on the remaining nodes (i.e., we exclude the two leaf nodes), we repeat this process until only one node remains, which is the root
(from https://en.m.wikipedia.org/wiki/Huffman_coding)
I guess the issue is that Huffman requires the alphabet to be predefined, where BPE "discovers it" as it goes along.