Hacker News new | past | comments | ask | show | jobs | submit login

It's probably operating on UTF-8 data on a byte-per-byte level without any additional processing. Just feeding it the raw string data and letting it assign the tokens.

It's similar to how it is splitting words at arbitrary points, rather than at clear morphological or lexical locations (e.g. on the Jane Austen text `"Now, ma'am," said Jane to her aunt, "shall we join Mrs. Elton?"` I've seen it tokenize that as `"|Now|,| ma|'|am|,"| said| Jane| to| her| aunt|,| "|shall| we| join| Mrs|.| El|ton|?"`).




I would find that hard to believe, as the bytes have zero semantic meaning, and moreover, pairing the wrong bytes in the output will result in complete gibberish. It would be akin to tokenizing each English letter "N|o|w|,| |m|a|'|a|m|..." except far worse.

Moreover it's trivially easy to tokenize the glyphs.


A character is the base unit of written communication. Single characters as tokens is not a bad idea, it just requires too much resources to make it learn and infer.

BPE is a tradeoff between single letters (computationally hard) and a word dictionary (can't handle novel words, languages or complex structures like code syntax). Note that tokens must be hardcoded because the neural network has an output layer consisting of neurons one-to-one mapped to the tokens (and the predicted word is the most activated neuron).

Human brains roughly do the same thing - that's why we have syllables as a tradeoff between letters and words.


> A character is the base unit of written communication

Yes, I guess the point here is that the glyph, not the byte, is the base unit of communication in Unicode charsets.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: