Accidentally quadratic ! Byte pair encoding by construction is quadratic on the ...

ladberg · on April 5, 2023

For GPT they didn't pre-split into words so they definitely have something faster than quadratic! Not sure what it is though, I'm very curious.

bobm_kite9 · on April 6, 2023

Yes, you're right. There could be multiple ways to tokenise a sentence. Shouldn't all the valid tokens be included in the vector?

GistNoesis · on April 6, 2023

If this splitting interest you having a look at https://github.com/openai/tiktoken/blob/main/src/lib.rs is great at showing all the ugly edge cases that causes instabilities.

In theory Byte Pair Encoding is unique, but practice makes it harder. It's also complicated due to regex and utf-8. Most of the time the differences should be too important because the neural network should be able to handle typos.

In BPE you may have plenty of escaping problems, problematic character like ' and \ are nasty to get right : worst case if you don't handle your errors being that if you have trained your byte pair encoding dictionary on escaped sentences, then a single \ should never occur as it is encoded as \\, so if you split the string between the \ then the byte pair encoding might fail to find the key in the dictionary.

Making the thing deterministic and stable when you change your regex version (and when you train one network you'd like to not have to retrain it when there is a bugfix in a regex library). Porting to other platforms also becomes very hard if you want replicable results.