Maybe 1 is actually hat you just suggested - an RL approach to select the strate...

codelion · 2024-09-16T02:15:15 1726452915

That’s the hardest part, figuring out the reward. For generic tasks it is not easy, in my implementation in optillm I am using the llm itself to generate a score based on the mcts trajectory. But that is not as good as having a reward that is well defined say for a coding or logic problem. May be they trained a better reward model.