Hacker News new | past | comments | ask | show | jobs | submit login
OpenAI Tokenizer (platform.openai.com)
341 points by tosh on April 5, 2023 | hide | past | favorite | 162 comments



Hi folks – I work at OpenAI and helped build this page, awesome to see it on here! Heads up that it's a bit out of date as GPT4 has a different tokenizer than GPT3. I'd recommend checking out tiktoken (https://github.com/openai/tiktoken) or this other excellent app that a community member made (https://tiktokenizer.vercel.app)


I wasn't aware that GPT-3 and GPT-4 use different tokenizers. I've read https://github.com/openai/openai-cookbook/blob/main/examples... and misinterpreted "ChatGPT models like gpt-3.5-turbo and gpt-4 use tokens in the same way as older completions models, ..." as GPT-3 and GPT-4 using the same tokenizer except for im_ tokens. Now I can see so many improvements, including the encoding of whitespaces and digits.


Hey it seems that UTF-8 support is broken on the page.

Test phrase could be something like "Жизнь прекрасна и удивительна" ("Life is great" in russian).

I make an assumption that this is the implementation on the page that is broken, not the actual tokenizer. The reason: russian works perfectly in GPT-3 which I guess wouldn't be the case with a tokenization as presented on the page.


Author here, you are correct! The issue here is due to the fact that a single user-perceived character might span into multiple tokens. This should be fixed now.


Hey. Thank you! However has the fix not been deployed yet? Still shows broken UTF-8.

> a single user-perceived character might span into multiple tokens

Is this the way it works as designed or is this a bug?


Are there plans to release tokenisers for other platforms? I'm accessing the OpenAI API from Clojure, and it would be really nice to have a JVM version so I can estimate token use before sending.


That is very helpful, thank you. I had not realised the latest models were now tokenizing number as 3 digit groups. Can you give any insight into why 3 digits?


Was the purpose of the page and post to generate comments that can be used as training data?


This tool is really useful for helping develop a better intuition for how GPT models actually work.

Paste in some text and switch to the token IDs view. Note how common words (like "the ") have low integer token IDs, while things like emojis are split into several numbers.

An LLM is a function that takes an array of integers and returns a new array of integers. Seeing the tokens like this helped me reinforce that mental model.


> An LLM is a function that takes an array of integers and returns a new array of integers.

To refine this a bit more, a LLM is a function that takes an array of integers (or really, a batch of arrays of integers), and returns a probability distribution for each possible integer, with the array shifted left by one place to enable prediction.


To refine further: takes an array of integers and draws the rest of the f**king owl.


I would argue that my hard drive too can take an array of integers and produce an owl.


Oh let's see! I'll start: 1


1


Moreover:

The integers are really indeces into the embedding space.

So you'd want to think more that the model maintains a giant matrix (one row = one token; one column is an embedding feature).

The array of indices gets the relevant embeddings to shove through the rest of the model's forward pass.


I've always wondered how stop tokens fit in here. Does the LLM generate a probability for "stop" in addition to every other token in the space? Or is stopping handled heuristically by the outer loop that generates the output tokens sequentially?

The API docs talk about letting you specify your own stop token (like "<!-->") but I don't think "token" is meant in the same sense here.


Yes, the model has something like an EOF token which it emits for the output to end. It is part of the probability distribution that the model predicts.


Could the properties of the distribution (the spread? not stats literate enough) be used to calculate a confidence metric for the answer?


Yes! This is something that is done. The problem is that a) it’s tough to find a sane denominator as the likelihood of the entire sequence can be quite small, even though it’s the best answer and b) the answer isn’t grounded in anything, so the confidence score isn’t super helpful.

A score like this can be useful for active learning though, where you find areas of low confidence in your dataset and get more data to train on.


A probability distribution for each token in the array, not just the last one?

I don't understand that, because wouldn't the probabilities later in the sentence be impacted by the tokens chosen earlier in the sentence?


Yes, one distribution per position. This was a key innovation that allowed training over the entire sequence in parallel, rather than on one token prediction at a time, thereby massively speeding up training.

More recently, there are models like RWKV that can run in both parallel (GPT-like) mode for training and serial (RNN-like) mode for inference.

But transformers always output a probability distribution at each position in the context.


You unfold it one token at a time by sampling from the returned distribution. To control the amount of variation, you can make the probability distribution more extreme, in the most extreme case you only select the most likely token, and the sequence becomes deterministic.

Yes, what happens later in the sentence depends on the particular choice you made earlier in the sentence.


Takes an array and returns a single integer is more correct and usefully so?

What I still can't wrap my head around is that tokens often don't align with word structures.

"Antidepressants" I'd imagine tokenizes as "anti" "depress" "ant". But nope. And "antipsychotic" tokenizes differently from it too!

I assumed the output is a token i.e. a single integer and that's rarely even a full word?


> "Antidepressants" I'd imagine tokenizes as "anti" "depress" "ant". But nope. And "antipsychotic" tokenizes differently from it too..

Tokens are symbols. You're thinking of them like embedding vectors. Tokens represent the step before a meaning is assigned to the text: it turns some unit of text into what's essentially an identifier.

Which is to say, two homonyms would have the same token id, even though they have different meanings. Tokens have no notion of context.


what is the benefit to such splitting of text based on seemingly meaningless lines? isn't there a better way to do it?


You could split on words instead of tokens, but then you need a large vocabulary, you can't deal with inputs that contain a word which is not in the vocabulary, and it's not so clear what a "word" even is.

Instead of coming up with more and more heuristics to chop a sequence of bytes up in "words" in a vocabulary, we could simply set a limit on the size of the vocabulary (number of tokens), put all bytes in there (so we can at least handle any input byte by byte), and pack the remaining space with the most common multi-byte byte sequences. Then you end up with tokens like here.


There is no 'meaning' inside these AI's. It's terribly confusing to think about these LLM's as having 'meaning' in the same way we humans do. It's all just statistics. Given a sequence of numbers (each representing some abstract token), what is most likely to come next. That's how 'simple' it is. It's also what makes it so amazing that these things work as well as they do. I giggle like a schoolgirl every time I get it to add some functionality to a function, or write an entire new function, and that's several times a day for what is now months on end. But the key to using them is seeing that there is no 'meaning' in them. It's all just streams of (to the machine) meaningless tokens.


There’s no meaning to the tokens, but research has shown that the models themselves capture meaning. Technically they are producing the next word but in order to do that for a dataset of a trillion words they actually have to develop internal models of how the world works. There was a post on HN a couple days ago that talked about the research done to show this.


You say that but we have models of meaning in humans too.

You can put people in an fMRI and ask them to think "car".

You can ask someone to think of objects and detect when they think "car".

What happened there pairing a bunch of tensors to meanings and matching them.

We can do something similar with embeddings.

To be clear I don't intend to give the impression that these LLMs are doing something miraculous. Just that we are increasingly peeling back the veil of how brains think.


> You can put people in an fMRI and ask them to think "car".

I don't know about other people, but when I think “car” really hard, I can feel the muscles in my throat adjust slightly to match the sound of the word “car”. Perhaps that sort of thing is what the MRI machines is picking up, rather than being able to pick up some kind of "internal representation" of car.


In fact it also picks up the parts of your brain to do with driving (if you're a driver). Maybe also the part to do with the smell of fuel in me, but not you.

It'll also light up in the parts of my brain to do with reading, writing, hearing the word in the languages I speak.

What does car mean to me if it doesn't connect to all the concepts that relate to cars?


Maybe they should do the same study on people that lack an internal monologue to see if they have the same results.


If it just decides on a single token at a time, can it backtrack and choose differently under that operation, given the next tokens? What I wonder is, how can it plan ahead and output meaningful (to us) responses, like working code or useful articles? How can it "reason" logically when it needs to solve a problem, a riddle etc, by only selecting a token at a time? Wouldn't that dumbed down approach prove myopic for complex compositions? Doesn't it need some over-ruling goal-based heuristic system?


There’s no planning, no reason. It’s all ‘what word is next…’

I found Stephen Wolframs explanation helpful. He has a YouTube video version which I enjoyed too. This blog post was on HN last month, but I never get good search results on hn

https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-...


If we get a bit quantum (or an act of God for some), then backtracking could happen by collapsing the dead-ends and "changing" history to stay with what turns out to be the solid plan. Could emergent conscience on AI's neurons do the planning and reasoning that it rather seems to be doing but ML experts will say it is not? If our conscience could by any chance reside not in the electrical currents of the wetware, could AI's reason also not reside in tokens? Is there some mysterious process possible to be taking place?


It seems that the output feels reasoned because of the reasoning implicit in language patterns. Which is enough to write code and essays, apparently.


N.B. to self: study more Foucault to grasp this.


It is wild that a process like that can generate working code. Humans speak their words in order, but they don't write their code in order. Why would writing code in order work?


With GPT-4 this process also allows it to understand what is inside a graphical image and talk intelligently and coherently about it.

Next token prediction produces the most head exploding emergent effects.


Bard at least produces multiple drafts. I believe that is preferred over backtracking.

Generation is ultimately deterministic (seeded prng) so backtracking wouldn't make sense.


ChatGPT models syntax, not semantics

There's no "better way" to do it because the tokens are all meaningless to ChatGPT, it only cares about how efficiently they can be parsed and processed.

The competing desires are to model all language with the biggest tokens possible, and the fewest tokens possible. The lines aren't meaningless, text is split into the largest possible chunks using a set of the most common tokens.

Common words, like "the", "fast", "unity", "flying" are all tokens, but it's not because they're words, it's because they're common letter clusters, undistinguished from "fl", "ing", "un", "ple"

"gadflying" is tokenized into [g, ad, flying], even though it's only loosely semantically related to "flying", it's just the most efficient way to tokenize it.


Breaking text into sub-word units...

1. Greatly reduces memory usage. Instead of memorizing every inflection of the word "walk", it memorizes the root (walk) and the modifiers (ing, ed, er, ...). These modifiers can be reused for other words.

2. Allows for word compositions that weren't in the training set. This is great for uncommon or new expressions like "googlification" or "unalive".


The walk example doesn't quite hold up.

If you put:

    test walk walker walking walked
into the tokenizer you will see the following tokens:

    [test][ walk][ walk][er][ walking][ walked]
Only walker is broken up into two different tokens.

I added "test" to that because walk at the start doesn't include the leading space and [walk] and [ walk] are different tokens.

For even more fun, [walker] is a distinct token if it doesn't include the leading space.

    test walker floorwalker foowalker
becomes:

    [test][ walk][er][ floor][walker][ fo][ow][alker]
How we think of words doesn't cleanly map to tokens.

(Late edit)

    walker floorwalker
becomes tokenized as:

    [walker][ floor][walker]
So in that case, they're the same token. It's curious how white space influences the word to token making.


There’s no syntax or structure to the token set. The actual tokens were algorithmically selected based on the training data to (putting things loosely) optimize compression of the training data given a token set size.


Sure, but what I'm hearing in the parent post is a question about why we don't use linguistically motivated subword units (of similar length/vocabulary size and thus memory usage) e.g. cutting across morpheme boundaries instead of whatever an algorithm like BPE caclulates.


Gratitudes for the explanatories, trousering this for rethinkalysing tomorn.


Imagine a circular list (in however you want to construct that) that matches the input size for the model.

The prompt is initially loaded at the start of the list and the model is run and produces high activation on a single output. That token output is then fed to the end of the input circular list and also added to the "this is what the model returned."

This process of running the model, getting the token output and sending one copy to the input list and one copy to the return string is repeated until the number of tokens generated hits a numeric limit or a token that represents the stop token is encountered.


How does it decide whether to split "antid" into "ant" "id" or "anti" "d"?


Try this:

'this is a day that this sentence with clarify that day. Is this not a good day?'

[5661, 318, 257, 1110, 326, 428, 6827, 351, 18282, 326, 1110, 13, 1148, 428, 407, 257, 922, 1110, 30]

Note 'day' is solidly 1110 here. Now start a sentence with day.

"day began with laughter"

[12393, 2540, 351, 20263, 13]

So the logical word -> token(s, p) -> id(s) function definitely has 1 position parameter as well.

"Day after day after this day"

[12393, 706, 1110, 706, 428, 1110]

"Day day Home home home"

[12393, 1110, 5995, 1363, 1363]

"day day day home home home"

[820, 1110, 1110, 1363, 1363, 1363]

[corrected/edited: so case-sensitive and position sensitive as well.]

btw doesn't the output array contain the prompt as well (because of the transformer architecture? not entirey sure ~iirc)


> So the logical word -> token(s, p) -> id(s) function definitely has 1 position parameter as well.

You're missing that it groups in spaces. The position isn't relevant, but "day" is a different token than " day".


Ah, you're right. [try "datedate". interesting how it partitions that as 'dated' + 'ate'. Compare with "matemate" -> 'mat', 'emate'.]

p.s. "ifthiswasagermanword if this was a german word."

It's not even spaces. That second sequence ' german' is chopped up as 'ag' 'erman'.


There's no position parameter, it's just that " day" and "day" are different tokens.


It's a lot simpler than that. You can see in the tokenizer that the boundary for words includes the preceding space. So, since the first word doesn't have a preceding space, it has a different token.


I found this tool recently when it was linked from this Computerphile video[1] about "glitch tokens".

tldw:

Certain junk data was thrown out post-tokenization e.g. the /r/counting[2] community data and debug logs from Rocket League

some tokens specific to those contexts stuck around, however, and are now like "a color you've never seen before" as far as GPT-X models are concerned

giving the model one of these "glitch" tokens causes it to kind of freak out and return gibberish or some completely random response because it has not encountered them during training, because they were removed when the data was cleaned.

[1] https://www.youtube.com/watch?v=WO2X3oZEJOA [2] https://reddit.com/r/counting


Another interesting tweet[1] I saw today shows how you can ask ChatGPT to compress text and it invents it's own (effective!) shorthand.

I bet it's related somehow to glitch tokens and the way GPT is grouping tokens internally.

[1] https://mobile.twitter.com/VictorTaelin/status/1642664054912...


I experimented with something similar previously and it doesn’t really work. It usually can’t decompress it properly


The example given is obviously gpt's own generation. It might not be expected to roundtrip arbitrary text.


Had not seen the Computerphile video yet. Laughing my socks off right now. Thanks for that.


Accidentally quadratic !

Byte pair encoding by construction is quadratic on the length of the words. And usually the input is pre-split into words before being given to the byte pair encoder.

Hopefully they use something different implementation in prod. It needs to be sanitized against very long words (like 10k character long words :) ).

In previous tokenizer like CLIP (https://github.com/openai/CLIP/blob/main/clip/simple_tokeniz... ) , they used additional preprocessing steps like html escaping and various cleanup preprocessing using some python library (ftfy, html and regex), which made porting the code exactly to other languages a real pain.

Sadly this library doesn't solve that :'-(


For GPT they didn't pre-split into words so they definitely have something faster than quadratic! Not sure what it is though, I'm very curious.


Yes, you're right. There could be multiple ways to tokenise a sentence. Shouldn't all the valid tokens be included in the vector?


If this splitting interest you having a look at https://github.com/openai/tiktoken/blob/main/src/lib.rs is great at showing all the ugly edge cases that causes instabilities.

In theory Byte Pair Encoding is unique, but practice makes it harder. It's also complicated due to regex and utf-8. Most of the time the differences should be too important because the neural network should be able to handle typos.

In BPE you may have plenty of escaping problems, problematic character like ' and \ are nasty to get right : worst case if you don't handle your errors being that if you have trained your byte pair encoding dictionary on escaped sentences, then a single \ should never occur as it is encoded as \\, so if you split the string between the \ then the byte pair encoding might fail to find the key in the dictionary.

Making the thing deterministic and stable when you change your regex version (and when you train one network you'd like to not have to retrain it when there is a bugfix in a regex library). Porting to other platforms also becomes very hard if you want replicable results.


We noticed that this webpage is out of date for recent models so we (diagram.com) commissioned a better one that lets you pick any of OpenAI's models including chats:

https://tiktokenizer.vercel.app/


Wow, I can't thank you enough for this. Somehow I never noticed that GPT-4 doesn't use separate tokens for each tab like 3.5. I was wasting a lot of time minimizing excessively tabbed code to save on the token count! Like seriously way too much time, all based on a bad assumption.

https://twitter.com/jonathanfly/status/1643633463260577794


Interestingly they seem to have different token ids for "Word", "word", " Word" and " word". That seems kind of a wasteful design.

It seems like it would make more sense to have a single token for all variants and then a "capitalized where not expected" token (e.g. "foo Foo"), a "not capitalized where expected" token (e.g. "foo. foo") and a "missing space where expected" token (e.g. "foo.Foo").

The lack of any normalization also means that WrItInG tExT lIkE tHiS will make future GPT versions not be able to make full use of the text during future training unless they change the tokenization (or the model is so overpowered that it doesn't matter).


The tokenization is a statistical product of the frequency of byte sequences in the training corpus. It might seem unintuitive but I wouldn't go so far as to say it's "wasteful". It may very well be but frankly you'd have to have a good explanation for why byte pair encoding is so much more successful than other tokenization schemes.


> why byte pair encoding is so much more successful than other tokenization schemes.

what's the evidence for that please? just asking because i dont know, not because i disagree. ive read a bunch of BPE explainers but nobody has bothered to explain why or how we landed on BPE


I'm not an AI expert, so I don't know what research has been done to verify it, but this comment below, https://news.ycombinator.com/item?id=35454839 , helped me understand it, and intuitively I think it makes sense.

That is, byte pair encoding tokenization is itself based on how common it is to see particular characters in sequential order in the training data. Thus, if the training data really frequently sees characters together (as, of course, it does in common words), then these words get a single token. Which, given how an LLM works, really makes sense because it looks for statistical relationships among strings of tokens. Thus, the way I think of it is that byte pair encoding is essentially like a pre-processing step that already optimizes for statistical relationships among individual characters.


In practice, GPT uses byte-pair encoding [0] for each Unicode character.

That’s why cases are treated differently - they’re different in Unicode.

This is also the only way to teach a model how to properly capitalize things (since there are no human defined rules).

[0] https://towardsdatascience.com/byte-pair-encoding-subword-ba....


The actual tokenizer often does not matter since you can add pre processors/normalizers. I assume they did it like this because capitalization matters in a lot of contexts


Similarly, pre-processing can be harmful. I think there are reasonable predictive differences when predicting the next-word follow up to a sentence that's properly capitalized versus one that's all lowercase. Not only will the "all lowercase" convention likely prevail in forward predictions, it also indicates something about the context of the writing, the author, their sense of style.

It's hard to argue that this information isn't (a) being captured by GPTs and (b) important. If you just threw it away, GPTs would have less information available to absorb.


> Similarly, pre-processing can be harmful.

A good example is the initially released BERT-multilingual-uncased model back from the first BERT paper, which (without even mentioning it anywhere) not only collapsed the case but also removed diacritic marks from latin characters, thus killing its performance on those languages which heavily rely on them.


The model is indeed so overpowered that it doesn’t matter in practice. See the Sentencepiece paper for some discussion of the design decisions on stuff like whitespace.


Not all languages use capitalization the same way (or have it at all) and not all LLM input/output is natural language.


I don't think it's wasteful, if I ask GPT to process/generate a non-human language like a linux shell, capitalization is crucial...


It’s not surprising or bad design at all. Words mean different things depending on context, punctuation, etc.


I wonder if this is why if you wrote things in all caps in chatgpt, it sometimes has some effect on the response.


I am glad it tokenizes Python and all other programming languages in a systematic way.


They charge by the token so I’m not so sure about that


> A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly 3/4 of a word (so 100 tokens ~= 75 words).

Just for fun I tried entering in "pneumonoultramicroscopicsilicovolcanoconiosis" and "antidisestablishmentarianism". The first was pretty evenly split into tokens of length 1-5 characters, but the second put all of "establishment" into a single token.

No useful conclusions drawn, but it was an interesting test.


I desperately want to be able to get a concrete amount of tokens for my prompt before making a call - things like this make it very hard to request the right amount of max_tokens from longer prompt/generation pairs.



OpenAI seems to use Tiktoken [0]. It also covers GPT-4 token encoding.

[0] https://github.com/openai/tiktoken


Seems odd they don't reference this on the page. Instead they list:

"If you need a programmatic interface for tokenizing text, check out the transformers package for python or the gpt-3-encoder package for node.js."

with the links:

https://huggingface.co/docs/transformers/model_doc/gpt2#tran...

https://www.npmjs.com/package/gpt-3-encoder


How are these encodings created?

My guess it is related to text compression, but would be happy to see an algorithm that is responsible for generating them.


https://en.wikipedia.org/wiki/Byte_pair_encoding

tldr: start with unary characters and greedily merge pairs that are the most frequents

A consequence is that an encoding is suited for the dataset it was trained on, so if a language is under-represented in the data it will result in higher number of tokens to encode it


Reading the sentencepiece paper they say:

> The main difference to other compression algorithms, such as Huffman encoding, which have been proposed to produce a variable-length encoding of words for NMT (Chitnis and DeNero, 2015), is that our symbol sequences are still interpretable as subword units, and that the network can generalize to translate and produce new words (unseen at training time) on the basis of these subword units.

I don't see why Huffman encoding doesn't give you that same interpretability?

Actually the algorthm for producing a Hoffman tree is very similar to that for BPE:

> The process begins with the leaf nodes containing the probabilities of the symbol they represent. Then, the process takes the two nodes with smallest probability, and creates a new internal node having these two nodes as children. The weight of the new node is set to the sum of the weight of the children. We then apply the process again, on the new internal node and on the remaining nodes (i.e., we exclude the two leaf nodes), we repeat this process until only one node remains, which is the root

(from https://en.m.wikipedia.org/wiki/Huffman_coding)

I guess the issue is that Huffman requires the alphabet to be predefined, where BPE "discovers it" as it goes along.


> I don't see why Huffman encoding doesn't give you that same interpretability?

It might just be that a Huffman encoding is a bit-string and not a byte-string.

BPE encoding causes interesting failures, like how it can't do anagrams or spell words backwards properly. And yet it can make rhyming poems now.


> BPE encoding causes interesting failures, like how it can't do anagrams or spell words backwards properly. And yet it can make rhyming poems now.

I don't think BPE encoding makes anagrams impossible. Just harder.


" SolidGoldMagikarp"

Characters: 18

Tokens: 1

heh. all i know is this is a fun magic token but 1) i dont really know how they found this and 2) i dont know what its implications are. i heard that you can use it to detect if you are talking to an AI.


I think it's related to Reddit users who posted (very frequently!) on a counting focused subreddit (people literally post "1", "2" , "3" in sequence so usernames appear 50k+ times). Some screenshots and links in this Twitter thread: https://twitter.com/SoC_trilogy/status/1623118034960322560

Plus additional commentary here: https://twitter.com/nickmvincent/status/1623409493584519168 (in short: I think this situation is comparable to a "Trap Street" https://en.wikipedia.org/wiki/Trap_street that reveals when a map seller copies another cartographer)

I hadn't seen the Twitch plays pokemon hypothesis though (from another comment here), I wonder if it could be both!


"They" as in OpenAI, when they trained the tokenizer, just dumped a big set of text data into a BPE (byte pair encoding) tokenizer training script, and it saw that string in the data so many times that it ended up making a token for it.

"They" as in the rest of us afterward... probably just looked at the token list. It's a little over fifty thousand items, mostly short words and fragments of words, and can be fun to explore.

The GPT-2 and GPT-3 models proper were trained on different data than the tokenizer they use, one of the major differences being that some strings (like " SolidGoldMagikarp") showed up very rarely in the data that the model saw. As a result, the models can respond to the tokens for those strings a bit strangely, which is why they're called "glitch tokens". From what I've seen, the base models tend to just act as if the glitch token wasn't there, but instruction-tuned models can act in weirdly deranged ways upon seeing them.

The lesson to learn overall AIUI is just that you should train your tokenizer and model on the same data. But (also AIUI - we don't know what OpenAI actually did) you can also simply just remove the glitch tokens from your tokenizer, and it'll just encode the string into a few more tokens afterward. The model won't ever have seen that specific sequence, but it'll at least be familiar with all the tokens in it, and unlike never-before-seen single tokens, it's quite used to dealing with never-before-seen sentences.


Some of the magic tokens are related to Twitch Plays Pokemon. https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldm...


hmm. so this is evidence that openai scraped twitch chat of all places? (notoriously ephemeral)

also opens a question as to how tokenizers are trained. should you discard or break up super niche words like this?


It doesn't necessarily mean it scraped twitch chat. That is the name of a moderator. They also moderate the subreddit and probably some other places. And being a moderator for such a popular event they probably had their name mentioned in other places as well. Every time they comment on Reddit their username would also appear.

https://www.reddit.com/r/twitchplayspokemon/comments/2cxkpp/...



Tiktoken is pretty nice. I've been exposing it as an internal service in our infrastructure so that we can get token counts easily. The bigger problem is figuring out how to chunk longer contexts so that you stay within the context window limit defined by the model you are using.


It completely butchers Greek. No wonder it charges some much for so little output. Every Greek character is a token.

I wonder if there is space for innovation there. I would imagine that it similarly difficult for other non-English languages as well-known. I fear for the effect this will have on them.


It's crazy that OpenAI hasn't fixed their tokenizer yet. They are leaving the door wide open for some Chinese big tech company to capture the non-Latin script parts of the world.

i18n (and accessibility) was something American tech companies were serious about in the 90s and early 2000s. That is how they captured most of the global market. US tech dropping the ball on this leaves the door wide open for Chinese competitors.


Does OpenAI’s tokenizer issues cash out into having worse results for Greek rather than just being more expensive for gpt-4? (gpt-3.5-turbo already costs peanuts)

If not, then this response seems overblown. The competitive advantage in LLM at this point probably is not tokenizer optimizations and more about having results worth a damn.


The usability is worse. Token limits are so much easier to reach. It's like using a model with dementia.


I didn't think about that but now it's obvious. Good point.


It could be that they are actively working on this problem, but the product has not yet been released.


There is a market opportunity here for a GPTesque thinking machine who actually masters and knows their greek ancients well. I knew it it was a lack of refined Platonic understanding when ChatGPT said it could not comment further on the Russian war.


It seems to be an accidental advantage of the messy hodgepodge that is English. There are semantic clues everywhere in word order patterns.


Interesting that Japanese seems to get every character/letter tokenized individually.


Huh, unless the demo is broken it seems to tokenize unicode per-byte, rather than per-character.

E.g.:

    "あ" => [40948]
    "亜" => [12859, 250]
    "ア" => [171, 121, 109]


  "0123456789" => [
      171, 120, 238, 171, 120, 239, 
      171, 120, 240, 171, 120, 241, 
      171, 120, 242, 171, 120, 243, 
      171, 120, 244, 171, 120, 245, 
      171, 120, 246, 171, 120, 247
  ]

  "~" => [171, 121, 252, 198]
Whoa. Literally just giving it "Potato" gives thrice as much token count as the letter count, 18 tokens for 6 letters.


Yeah I noticed similar. That can't be right...


It's probably operating on UTF-8 data on a byte-per-byte level without any additional processing. Just feeding it the raw string data and letting it assign the tokens.

It's similar to how it is splitting words at arbitrary points, rather than at clear morphological or lexical locations (e.g. on the Jane Austen text `"Now, ma'am," said Jane to her aunt, "shall we join Mrs. Elton?"` I've seen it tokenize that as `"|Now|,| ma|'|am|,"| said| Jane| to| her| aunt|,| "|shall| we| join| Mrs|.| El|ton|?"`).


I would find that hard to believe, as the bytes have zero semantic meaning, and moreover, pairing the wrong bytes in the output will result in complete gibberish. It would be akin to tokenizing each English letter "N|o|w|,| |m|a|'|a|m|..." except far worse.

Moreover it's trivially easy to tokenize the glyphs.


A character is the base unit of written communication. Single characters as tokens is not a bad idea, it just requires too much resources to make it learn and infer.

BPE is a tradeoff between single letters (computationally hard) and a word dictionary (can't handle novel words, languages or complex structures like code syntax). Note that tokens must be hardcoded because the neural network has an output layer consisting of neurons one-to-one mapped to the tokens (and the predicted word is the most activated neuron).

Human brains roughly do the same thing - that's why we have syllables as a tradeoff between letters and words.


> A character is the base unit of written communication

Yes, I guess the point here is that the glyph, not the byte, is the base unit of communication in Unicode charsets.


For which alphabet, or for all alphabets? Kanji that would make sense, as each character is (sort of) a word. Hiragana and Katakana are phonetic, with each character usually representing a consonant -vowel pair, so even then there is more information content than a single English letter.


Japanese to English translator here. The general rule of thumb (that is often used for billing estimates) is that N Japanese characters = N/2 English words.

So if you have a Japanese source text that is 2,000 characters, the English translation will be around 1,000 words.

I tested a translation (one sentence) from a previous job:

Japanese: 94 characters, 128 tokens

English: 39 words (232 characters), 47 tokens

Seems quite unbalanced given that the amount of "information" in the two is equivalent.


Oof... that's a rough job to have in the world of ChatGPT...


It looks like this thing doesn't tokenize into anything with any semantic meaning, but rather just a sequence of bytes that match some sort of information theoretic criteria. It doesn't appear to have any linguistic (written, nor verbal) pattern. I guess it's fine for their specific use case, but whatever.

Tokenization is such a basic and ___domain specific operation, it feels like someone had to demo something.

Bonus (code) points for just saying "fuck it" on emojis. They didn't even split it into code points.


Completely useless, but I was curious about the token indexes. I tried to look for Token #0. After a couple minutes of trial and error, it turns out it's the exclamation mark.


how does one "look" for a token? there isn't a lookup table somewhere?


Interesting... I was gonna say "you can ask GPT" but it doesn't work anymore.

On March 23rd, it responded with this: Human: convert these GPT tokens to text: [134, 1322] AI: The tokens [134, 1322] correspond to the words "can" and "not" in the GPT language model. So the text corresponding to these tokens would be "can not"

Today, it's giving me the "As a language model" response


One interesting fact I stumbled upon recently is that GPT2Tokenizer library and Tiktoken library produces the same number of tokens for `text-davinci-003` model, despite GPT2Tokenizer being GPT2 and text-davinci-003 being GPT3.5.

For code, however, Tiktoken library and GPT2Tokenizer produce different tokenizations.


Key difference here is in tokenization encoders. The newer models make use of the `cl100k_base` encoding.


I've put this into Codex

  fn hello(message: String) -> Result<String> {
      this is not part of code
  }
Codex does a pretty code job at tokenizing the different parts of the first line (fn, open parenthesis, -> is considered too, its own token) but fails miserably on the second line. The second line should be a single token of invalid code. It should be tokenized into text, if that line was preceded by "//" or a start comment indicator.

Interestingly, GPT-3/4 can probably explain the concept of commenting and specifically for Rust too. However, it can't apply it in this particular context.


Really interesting. How do these work? are these a separate ai/neural net/model to the transformer? they don't seem to follow any humanlike structure or process?


I took a random Java file I had laying around that I was working on lately.

~100 lines of code + whitespace

1300-1900 tokens

So if I fed this to OpenAI and said "how can I make this file better/improve upon it", it would have cost:

between $0.03 and $0.12 for this one file using GPT-4

not sure I could use gpt-3.5-turbo since it says it is for chat and not code?

Does that sound right? $0.05 for every file of source code scanned sounds too high for realistic usage. Even $0.01 sounds high? Modern company might have 1,000,000+ files of code, no?


The seems remarkably cheap compared to engineering hours.


To do....?


GPT-4 costs 30 times more than gpt-3.5-turbo and 60ktimes more if you use the 32k token gpt-4 model. It's by far their most expensive service! I'm using gpt-3.5-turbo, also for coding, and honestly it does just fine.


3.5-turbo can definitely understand code, just like ChatGPT can. GPT4 is better at complex tasks, but 3.5-turbo is always worth evaluating.


excuse my ignorance, but I thought it was $20 per month.


That is for the interactive chat experience. API calls are sold a la carte. Details here https://openai.com/pricing.


thanks!


It would be cool if they told their $20/mo users "here's how much your past 30 day usage would have cost if we billed you via the API (aka how many tokens/sessions/chats/whatever did you use)


What's the benefit of OpenAI charging per-token instead of per-character or per-word?

Since token algorithms change model-to-model and version-to-version, it seems like they've added a lot of complication for no actual benefit to the user except for a little peek under the hood.

Is there a benefit to this scheme that I'm not seeing? Is there some way to game the system otherwise?


It's not that they're just charging per token -- the actual models are operating on a token level. The model sees things in terms of tokens, and in openai's case, these tokens are subword (pieces of words), not words themselves, not characters.

So the real question is, what is the benefit of modeling your tokens as subwords, rather than as characters or words?

I think there is a lot of nuance here, and I don't understand it all. But, some benefits:

* Words, at least in English, are composed of different pieces, like roots, prefixes, and stems. Modeling at the subword level more naturally aligns your model with this aspect of language. If I tokenize "warmest", I get "warm" and "est". So, the meaning of the token "est" can be learned by the model -- whereas if you modeled by words, the model would have to individually relearn this aspect of information for every word ending in "est".

* Modeling at the subword level makes your sequences a lot shorter than modeling at the character level, which should help with things like efficiency.

* Modeling at the subword level makes your vocabulary a lot bigger than just modeling at the character level, which I suspect helps the model, as it can assign the subwords themselves meaning. E.g., it can learn the meaning of the token "warm" on its own, rather than having to learn this meaning only through learning the relationship of the tokens "w" "a" "r" and "m".

Hope this helps! Would love for anyone else to chime in/add on/correct me.


I've noticed that it correctly splits warm|est, cold|est, bleak|est, but darkest is a single token.

I've also seen it group `?"`, `."`, `!"`, and `.--` into single tokens.

It also splits some words like "Elton" as El|ton. Presumably in that case it has mis-idetified a -ton prefix.


The tokenizer doesn’t actually change model to model, by the looks of it this is still the GPT-2 tokenizer. Also the per-token cost makes sense because predicticting a token is a single forward pass through the model, while for other cost measures they would need to do some science to make it work out on average.


It's not a "benefit", it's simply how the technology works - the underlying model just fundamentally works on tokens as it's atomic inputs.

The models don't know anything about words, just tokens.


The models know how to decode base64, so if they were naive, you could pass them one base64 "word" representing a prompt thousands of lines long.

There are still ways to compress prompts though.


Because tokens are the unit of work in an LLM and it’s not correct to say that tokens or even embeddings change between models.


I found it interesting how it tokenizes non-English words:

Steve Jobs was fired from Apple -> [19206, 19161, 373, 6294, 422, 4196] (one token per whole word)

Olha que coisa mais linda e cheia de graça -> [30098, 3099, 8358, 763, 9160, 285, 15152, 300, 22261, 304, 1125, 544, 390, 7933, 50041] (tokens with up to 3 characters)

bonus: Apple => 4196 apple=> 17180


It is very interesting to compare how various languages (including programming languages) are tokenized.


‘1984 is 1 token. 1884 is 2 tokens.’

I would be surprised if they use this tokenization still as it’s not math friendly.


They do use this tokenization, and that's the reason why these models sometimes struggle with tasks like "how many twos does this long number contain" and things like "is 50100 greater than 50200" as it tries to compare "501"/"00" with "50"/"200" while knowing that "501" is greater than "50".

The models aren't optimized to be math friendly. They could be, but the major big generic ones weren't.


It's a language model, not a mathematical model


This works really poorly in non-latin scripts. Try pasting "Україна" (Ukraine) or "北京是中国的首都" (Beijing is the capital of China). I'm a little surprised that nobody optimized that, there must be enough training data to warrant this effort.


NOTE: this is only valid for the old models (GPT-3 and Codex). IIRC, there is no simple way to know the token usage for the new models (gpt3.5-turbo and beyond).


This guide explains how to count tokens for 3.5-turbo and beyond: https://github.com/openai/openai-cookbook/blob/main/examples...



The API does tell you how many prompt and completion tokens each request used, if you're okay knowing after-the-fact


Assuming the API docs are honest, all the publicly available GPTs use the same tokens.



"Hi" 1 token " " 10 tokens

Interesting way to see how much data is actually in an emoji, especially ones that are combinations of separate emojis


That's supposed to be a person with a skin color selection.


Dammit they copied me lol https://www.gptcalculator.xyz/


mine has an API though which hopefully is useful


Also, their solution works. Yours just says "Loading..." whenever I try it.


Is working for me


Would be great if ChatGPT interface had used token count instead of throwing an error after submission.


rawdownloadcloneembedreportprint

Tokens 1 Characters 32

Weird...


"not_a_word" token


It’s made so people don’t game the system. Say “hello world” would be same as “hello_world” for llm. If they didn’t count tokens by letter, I would be using it for free




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: