Hacker News new | past | comments | ask | show | jobs | submit login

But it has to emit hundreds of tokens per test. Does that mean it takes hundreds of times longer to train? Or longer because I imagine the feedback loop can cause huge instabilities in gradients. Or are all GPTs trained on longer formats now; i.e. is "next word prediction" just a basic thing from the beginning of the transformers era?



takes a long time yes, but not longer than pretraining. sparse rewards are a common issue in RL and addressed by many techniques (I'm not expert so I can't say more). Model only does next word prediction and generates a number of trajectories, the correct ones get rewarded (those predictions in the correct trajectory have their gradients propagated back and reinforced).


Good point, hadn't considered that all RL models have the same challenge. So far I've only tinkered with next token prediction and image classification. Now I'm curious to dig more into RL and see how they scale it. Especially without a human in the loop, seems like a challenge to grade the output; it's all wrong wrong wrong random tokens until the model magically guesses the right answer once a zillion years from now.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: