Byte pair encoding by construction is quadratic on the length of the words.
And usually the input is pre-split into words before being given to the byte pair encoder.
Hopefully they use something different implementation in prod.
It needs to be sanitized against very long words (like 10k character long words :) ).
In previous tokenizer like CLIP (https://github.com/openai/CLIP/blob/main/clip/simple_tokeniz... ) , they used additional preprocessing steps like html escaping and various cleanup preprocessing using some python library (ftfy, html and regex), which made porting the code exactly to other languages a real pain.
In theory Byte Pair Encoding is unique, but practice makes it harder. It's also complicated due to regex and utf-8. Most of the time the differences should be too important because the neural network should be able to handle typos.
In BPE you may have plenty of escaping problems, problematic character like ' and \ are nasty to get right : worst case if you don't handle your errors being that if you have trained your byte pair encoding dictionary on escaped sentences, then a single \ should never occur as it is encoded as \\, so if you split the string between the \ then the byte pair encoding might fail to find the key in the dictionary.
Making the thing deterministic and stable when you change your regex version (and when you train one network you'd like to not have to retrain it when there is a bugfix in a regex library). Porting to other platforms also becomes very hard if you want replicable results.
Byte pair encoding by construction is quadratic on the length of the words. And usually the input is pre-split into words before being given to the byte pair encoder.
Hopefully they use something different implementation in prod. It needs to be sanitized against very long words (like 10k character long words :) ).
In previous tokenizer like CLIP (https://github.com/openai/CLIP/blob/main/clip/simple_tokeniz... ) , they used additional preprocessing steps like html escaping and various cleanup preprocessing using some python library (ftfy, html and regex), which made porting the code exactly to other languages a real pain.
Sadly this library doesn't solve that :'-(