Thanks for sharing. I had trouble reading the transcript, so here is Claude's cl...

spwa4 · 2025-04-23T09:40:44 1745401244

Just don't.

This is a horrible summary. It is both too complex and to simple at the same time. This summary spends about half it's time talking about pass@k while failing to explain what it is and giving a great deal of good-sounding but misleading statements, making me think Claude completely misunderstood (it is absolutely not like majority voting). Pass@k means you get k attempts to answer a question. Right? You passed. Wrong? Well, you've got k (for example 10) attempts.

The paper itself is much better. Hell, the conclusion of the paper is so much better than what you have here.

Here's a decent summary, directly from the paper's conclusion:

1. RL-trained models perform worse than base models in pass@k at large k values. (note that Claude's explanation of what pass@k is in the parent post is extremely wrong)

2. RL boosts sampling efficiency but reduces the reasoning capacity boundary.

3. RLVR algorithms perform similarly and remain far from optimal.

4. RLVR and distillation are fundamentally different.

And here's a one-line summary from me:

This paper claims that RL(VR) training is like improving the model's search algorithm: it becomes (a lot) better at locating a good answer within the model, but also pushes the model too hard to give only this answer.

Before Claude makes another absurd claim RL = reinforcement learning (for example, for safety. Say, trying to get the model to explain breaking into a car, if it ever does, that's bad), RLVR = reinforcement learning with verifiable rewards (meaning you get to think as much as you want, as long as your final answer is correct. But you get to reminisce/think as much as you want before giving a final answer, and that thinking does not have to be relevant)

And a comment: this is exactly what you'd expect to see from mild overtraining of the model. It could be that the current big players are pushing the models to be right/helpful/safe too hard, and taking away too much "freedom" in the process.

GloamingNiblets · 2025-04-23T16:12:01 1745424721

I appreciate the feedback, another reminder to not lean too much on LLMs.

mountainriver · 2025-04-23T02:09:23 1745374163

This also seems to be why rejection sampling + SFT seems just as good if not better in a lot of scenarios