> Quiet-STaR: Language Models Can Teach Themselves to
Think Before Speaking
This paper suggests that a large language model should "think ahead" by predicting not only the next token but also a "supporting thought." The approach involves generating all tokens simultaneously, allowing for a single forward pass that produces both the next token and a supporting thought, which might consist of, for example, 16 tokens.
This supporting thought influences the model's prediction. The process is then extended to multiple supporting thoughts by ingeniously masking cross-attention between thoughts to ensure their independence. So in essence we can fill all the remaining context with supporting thoughts and benefit from all of them in the same single forward pass.
The supporting thoughts themselves are trained with the objective to maximize the probability of a longer sequence ahead, using RL. So they are trained to optimize for longer-term, instead of the myopic next token prediction task.
This paper suggests that a large language model should "think ahead" by predicting not only the next token but also a "supporting thought." The approach involves generating all tokens simultaneously, allowing for a single forward pass that produces both the next token and a supporting thought, which might consist of, for example, 16 tokens.
This supporting thought influences the model's prediction. The process is then extended to multiple supporting thoughts by ingeniously masking cross-attention between thoughts to ensure their independence. So in essence we can fill all the remaining context with supporting thoughts and benefit from all of them in the same single forward pass.
The supporting thoughts themselves are trained with the objective to maximize the probability of a longer sequence ahead, using RL. So they are trained to optimize for longer-term, instead of the myopic next token prediction task.
https://arxiv.org/abs/2403.09629