Hacker News new | past | comments | ask | show | jobs | submit login

A character is the base unit of written communication. Single characters as tokens is not a bad idea, it just requires too much resources to make it learn and infer.

BPE is a tradeoff between single letters (computationally hard) and a word dictionary (can't handle novel words, languages or complex structures like code syntax). Note that tokens must be hardcoded because the neural network has an output layer consisting of neurons one-to-one mapped to the tokens (and the predicted word is the most activated neuron).

Human brains roughly do the same thing - that's why we have syllables as a tradeoff between letters and words.




> A character is the base unit of written communication

Yes, I guess the point here is that the glyph, not the byte, is the base unit of communication in Unicode charsets.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: