ChatGPT models syntax, not semantics There's no "better way" to do it because th...

ChatGPT models syntax, not semantics

There's no "better way" to do it because the tokens are all meaningless to ChatGPT, it only cares about how efficiently they can be parsed and processed.

The competing desires are to model all language with the biggest tokens possible, and the fewest tokens possible. The lines aren't meaningless, text is split into the largest possible chunks using a set of the most common tokens.

Common words, like "the", "fast", "unity", "flying" are all tokens, but it's not because they're words, it's because they're common letter clusters, undistinguished from "fl", "ing", "un", "ple"

"gadflying" is tokenized into [g, ad, flying], even though it's only loosely semantically related to "flying", it's just the most efficient way to tokenize it.