> In order to predict "An" instead of “A”, you need to know that you're going to say something that starts with a vowel next. So you're incentivized to figure out one word ahead, and indeed, Claude realizes it's going to say astronomer and works backwards.
Is there evidence of working backwards? From a next token point of view,
predicting the token after "An" is going to heavily favor a vowel. Similarly predicting the token after "A" is going to heavily favor not a vowel.
Firstly, there is behavioral evidence. This is, to me, the less compelling kind. But it's important to understand. You are of course correct that, once Cluade has said "An", it will be inclined to say something starting with a vowel. But the mystery is really why, in setups like these, Claude is much more likely to say "An" than "A" in the first place. Regardless of what the underlying mechanism is -- and you could maybe imagine ways in which it could just "pattern match" without planning here -- it is preferred because in situations like this, you need to say "An" so that "astronomer" can follow.
But now we also have mechanistic evidence. If you make an attribution graph, you can literally see an astronomer feature fire, and that cause it to say "An".
Is there evidence of working backwards? From a next token point of view, predicting the token after "An" is going to heavily favor a vowel. Similarly predicting the token after "A" is going to heavily favor not a vowel.