> They don't talk about how they get to 10M token context I don't know how eithe...

albertzeyer · on Feb 15, 2024

Why do you think this specific variant (RingAttention)? There are so many different variants for this.

As far as I know, the problem in most cases is that while the context length might be high in theory, the actual ability to use it is still limited. E.g. recurrent networks even have infinite context, but they actually only use 10-20 frames as context (longer only in very specific settings; or maybe if you scale them up).

renonce · on Feb 16, 2024

There are ways to test the neural network’s ability to recall from a very long sequence. For example, if you insert a random sentence like “X is Sam Altman” somewhere in the text, will the model be able to answer the question “Who is X?”, or maybe somewhat indirectly “Who is X (in another language)” or “Which sentence was inserted out of context?” “Which celebrity was mentioned in the text?”

Anyways the ability to generalize to longer context length is evidenced by such tests. If every token of the model’s output is able to answer questions in such a way that any sentence from the input would be taken into account, this gives evidence that the full context window indeed matters. Currently I find Claude 2 to perform very well on such tasks, so that sets my expectation of how a language model with an extremely long context window should look like.