"They" as in OpenAI, when they trained the tokenizer, just dumped a big set of t...

"They" as in OpenAI, when they trained the tokenizer, just dumped a big set of text data into a BPE (byte pair encoding) tokenizer training script, and it saw that string in the data so many times that it ended up making a token for it.

"They" as in the rest of us afterward... probably just looked at the token list. It's a little over fifty thousand items, mostly short words and fragments of words, and can be fun to explore.

The GPT-2 and GPT-3 models proper were trained on different data than the tokenizer they use, one of the major differences being that some strings (like " SolidGoldMagikarp") showed up very rarely in the data that the model saw. As a result, the models can respond to the tokens for those strings a bit strangely, which is why they're called "glitch tokens". From what I've seen, the base models tend to just act as if the glitch token wasn't there, but instruction-tuned models can act in weirdly deranged ways upon seeing them.

The lesson to learn overall AIUI is just that you should train your tokenizer and model on the same data. But (also AIUI - we don't know what OpenAI actually did) you can also simply just remove the glitch tokens from your tokenizer, and it'll just encode the string into a few more tokens afterward. The model won't ever have seen that specific sequence, but it'll at least be familiar with all the tokens in it, and unlike never-before-seen single tokens, it's quite used to dealing with never-before-seen sentences.