If it knows it needs to backtrack then could it gain much by outputting something that tells the code to backtrack for it? For example, outputting something like "I've disproven the previous hypothesis, remove the details". Almost like asking to forget.
This could reduce the number of tokens it needs at inference time, saving compute. But with how attention works, it may not make any difference to the performance of the LLM.
Similarly, could there be gains by the LLM asking to work in parallel? For example "there's 3 possible approaches to this, clone the conversation so far and resolve to the one that results in the highest confidence".
This feels like it would be fairly trivial to implement.
This could reduce the number of tokens it needs at inference time, saving compute. But with how attention works, it may not make any difference to the performance of the LLM.
Similarly, could there be gains by the LLM asking to work in parallel? For example "there's 3 possible approaches to this, clone the conversation so far and resolve to the one that results in the highest confidence".
This feels like it would be fairly trivial to implement.