But it has to emit hundreds of tokens per test. Does that mean it takes hundreds...

Davidzheng · 2025-02-07T07:39:24 1738913964

takes a long time yes, but not longer than pretraining. sparse rewards are a common issue in RL and addressed by many techniques (I'm not expert so I can't say more). Model only does next word prediction and generates a number of trajectories, the correct ones get rewarded (those predictions in the correct trajectory have their gradients propagated back and reinforced).

daxfohl · 2025-02-07T16:35:33 1738946133

Good point, hadn't considered that all RL models have the same challenge. So far I've only tinkered with next token prediction and image classification. Now I'm curious to dig more into RL and see how they scale it. Especially without a human in the loop, seems like a challenge to grade the output; it's all wrong wrong wrong random tokens until the model magically guesses the right answer once a zillion years from now.