It's probably operating on UTF-8 data on a byte-per-byte level without any addit...

dudeinjapan · on April 5, 2023

I would find that hard to believe, as the bytes have zero semantic meaning, and moreover, pairing the wrong bytes in the output will result in complete gibberish. It would be akin to tokenizing each English letter "N|o|w|,| |m|a|'|a|m|..." except far worse.

Moreover it's trivially easy to tokenize the glyphs.

pyentropy · on April 6, 2023

A character is the base unit of written communication. Single characters as tokens is not a bad idea, it just requires too much resources to make it learn and infer.

BPE is a tradeoff between single letters (computationally hard) and a word dictionary (can't handle novel words, languages or complex structures like code syntax). Note that tokens must be hardcoded because the neural network has an output layer consisting of neurons one-to-one mapped to the tokens (and the predicted word is the most activated neuron).

Human brains roughly do the same thing - that's why we have syllables as a tradeoff between letters and words.

dudeinjapan · on April 6, 2023

> A character is the base unit of written communication

Yes, I guess the point here is that the glyph, not the byte, is the base unit of communication in Unicode charsets.