May the effect you see, when you spell it out, be not a result of “seeing” tokens, but a result of the fact that a model learned – at a higher level – how lists in text can be summarized, summed up, filtered and counted?
Iow, what makes you think that it’s exactly letter-tokens that help it and not the high-level concept of spelling things out itself?
It's more that it's liable to struggle to guess how to spell tokens [10295, 947] (or whatever it is) since there's no a priori reason that it will learn to associate them with the exact right tokens for the individual letters in the right order. If it's trained on bytes though, it doesn't need to infer that. It's like asking a smart, semi-literate person a spelling question- they might have a rough sense of it but they will not be very good at it.
Once it is just counting lists then it's probably drawing on a higher level capability, yeah.
Iow, what makes you think that it’s exactly letter-tokens that help it and not the high-level concept of spelling things out itself?