The way this is being described is almost like a maze-traversal algorithm, where compute time is "how far I'm willing to go down a path to test whether it's a possible solution." I wonder what other parallels we might find. For instance, are some of the maze-solving algorithms relevant to apply to LLMs?
Sampling sequentially to find the highest joint probability over the sequence is definitely a search problem. that's why you see algorithms like beam search often used for sampling.
Yes that's right, it seems like an area of more research.
Honestly it goes counter to the Bitter Lesson (http://www.incompleteideas.net/IncIdeas/BitterLesson.html, which stems from getting too fancy about maze traversal in Chess. But at the scale LLMs are at right now, the improvements might be worth it.
Hi, contributor to Entropix here. This is just my opinion, but I don't think it goes counter to the Bitter Lesson at all, because it's meant to leverage model computation capabilities. Several papers have suggested that models internally compute certainty (https://arxiv.org/abs/2406.16254), and in my view our method simply leverages this computation and factors it explicitly into decoding.
This is as opposed to pure sampling + next token prediction which basically randomly chooses a token. So if a model does 1274 x 8275 and it's not very sure of the answer, it still confidently gives an answer even though it's uncertain and needs to do more working.