Hacker News new | past | comments | ask | show | jobs | submit login

Accidentally quadratic !

Byte pair encoding by construction is quadratic on the length of the words. And usually the input is pre-split into words before being given to the byte pair encoder.

Hopefully they use something different implementation in prod. It needs to be sanitized against very long words (like 10k character long words :) ).

In previous tokenizer like CLIP (https://github.com/openai/CLIP/blob/main/clip/simple_tokeniz... ) , they used additional preprocessing steps like html escaping and various cleanup preprocessing using some python library (ftfy, html and regex), which made porting the code exactly to other languages a real pain.

Sadly this library doesn't solve that :'-(




For GPT they didn't pre-split into words so they definitely have something faster than quadratic! Not sure what it is though, I'm very curious.


Yes, you're right. There could be multiple ways to tokenise a sentence. Shouldn't all the valid tokens be included in the vector?


If this splitting interest you having a look at https://github.com/openai/tiktoken/blob/main/src/lib.rs is great at showing all the ugly edge cases that causes instabilities.

In theory Byte Pair Encoding is unique, but practice makes it harder. It's also complicated due to regex and utf-8. Most of the time the differences should be too important because the neural network should be able to handle typos.

In BPE you may have plenty of escaping problems, problematic character like ' and \ are nasty to get right : worst case if you don't handle your errors being that if you have trained your byte pair encoding dictionary on escaped sentences, then a single \ should never occur as it is encoded as \\, so if you split the string between the \ then the byte pair encoding might fail to find the key in the dictionary.

Making the thing deterministic and stable when you change your regex version (and when you train one network you'd like to not have to retrain it when there is a bugfix in a regex library). Porting to other platforms also becomes very hard if you want replicable results.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: