Why so adamant that models work on ‘words’? ChatGPT3.4/4 tokens: “Elephant”: 464...

HarHarVeryFunny · on March 22, 2024

I didn't say WORK on words, I said OUTPUT words.

But, it doesn't make any difference whether you are considering tokens or words. There are multiple possible continuations of the prompt, and the next word (or token) output does not - in general - force the word (or token) after that ...

Your "large grey mammal" could be an "elected official in a grey suit".

jameshart · on March 22, 2024

Right, it’s possible, but when the LLM places a high probability on the “ele” token it’s not because it predicts “elected official” is a likely continuation. It’s because it’s thinking about elephants.

Likewise when a coding LLM starts outputting a for each loop, it’s doing so because it expects to want to write some code that operates on each item in a list. I don’t see how you can explain that behavior without thinking that it must be generating some sort of high level algorithmic plan that causes it to feel like the next thing it should output is some sort of ‘foreach’ token.

HarHarVeryFunny · on March 22, 2024

I'm not disagreeing with what is presumably happening, but rather on how to characterize that.

Of course next word predictions are not based directly on surface level word sequence patterns - they are based on internal representations of what these word sequences mean, and predicted continuations are presumably going to be at a similar level of abstraction/representation (what you are calling a plan). This continuation "plan" then drives actual word selection/prediction.

Where we seem to differ is whether this high level continuation representation can really be considered as a "plan". To me the continuation is just a prediction, as are the words that might be used to start expressing that continuation, and presumably it's not even a single continuation with multiple ways of expressing it (turning it into a word sequence), but rather some superposition of multiple alternate continuations.

When we get to the level of words output it becomes even less plan-like since the actual word output is randomly sampled, and when fed back in as part of the "sentence so far" may cause the model to predict a different continuation (or set of continuations) than it had at the prior step. So, any "plan" (aka predicted continuation) is potentially changing continuously from word to word, rather than being decided ahead of time and then executed. As I noted elsewhere in this thread, the inability to plan multiple words ahead is behind these model's generally poor performance on the "give me a sentence ending in <word>" task, as opposed to perfect performance on the "give me a sentence starting with <word>" one.

If we contrast this behavior of a basic LLM to the "tree of thoughts" mechanism that has been proposed, it again highlights how unplan-like the basic behavior is. In the tree of thoughts mechanism the model is sampled from multiple times generating multiple alternate (multi-word) continuations, which are then evaluated with the best being chosen. If the model were really planning ahead of time it seems this should not be necessary - planning would consist of considering the alternatives BEFORE deciding what to generate.