Hacker News new | past | comments | ask | show | jobs | submit login

Thanks for sharing. I had trouble reading the transcript, so here is Claude's cleaned up version and summary:

Here's the condensed and formatted transcription in a single paragraph: This is the last thing I want to highlight this section on why RL works. Here they evaluate different things - they evaluate specifically pass at K and maj at K. Maj at K is like majority voting, so what you do is you have a model, you have a question, and you output not just one output but an ordered set. So you give your top 20 answers - 0 is your best answer that the model wants to give most, then the second most answer, third most answer, and so on. They could all be correct, just different reformulations of the same answer or different derivations stated in different ways. What you're interested in is how many of the top K results are correct - that's the pass at K. And if you had to vote if majority voting on the top K, how often would you be correct then? There's a slight difference, and that slight difference is actually made more drastic by reinforcement learning. They say, "As shown in figure 7, reinforcement learning enhances majority at K performance but not pass at K." These findings indicate that reinforcement learning enhances the model's overall performance by rendering the output distribution more robust. In other words, it seems that the improvement is attributed to boosting the correct response from Top K rather than the enhancement of fundamental capabilities. This is something we've come to learn in many different ways from reinforcement learning on language models or even supervised fine-tuning - what's happening most likely is that the capabilities of doing all of these things are already present in the underlying pre-trained language model. Summary: Reinforcement learning improves language model performance not by enhancing fundamental capabilities but by making the output distribution more robust, effectively boosting correct responses within the top results rather than improving the model's inherent abilities.




Just don't.

This is a horrible summary. It is both too complex and to simple at the same time. This summary spends about half it's time talking about pass@k while failing to explain what it is and giving a great deal of good-sounding but misleading statements, making me think Claude completely misunderstood (it is absolutely not like majority voting). Pass@k means you get k attempts to answer a question. Right? You passed. Wrong? Well, you've got k (for example 10) attempts.

The paper itself is much better. Hell, the conclusion of the paper is so much better than what you have here.

Here's a decent summary, directly from the paper's conclusion:

1. RL-trained models perform worse than base models in pass@k at large k values. (note that Claude's explanation of what pass@k is in the parent post is extremely wrong)

2. RL boosts sampling efficiency but reduces the reasoning capacity boundary.

3. RLVR algorithms perform similarly and remain far from optimal.

4. RLVR and distillation are fundamentally different.

And here's a one-line summary from me:

This paper claims that RL(VR) training is like improving the model's search algorithm: it becomes (a lot) better at locating a good answer within the model, but also pushes the model too hard to give only this answer.

Before Claude makes another absurd claim RL = reinforcement learning (for example, for safety. Say, trying to get the model to explain breaking into a car, if it ever does, that's bad), RLVR = reinforcement learning with verifiable rewards (meaning you get to think as much as you want, as long as your final answer is correct. But you get to reminisce/think as much as you want before giving a final answer, and that thinking does not have to be relevant)

And a comment: this is exactly what you'd expect to see from mild overtraining of the model. It could be that the current big players are pushing the models to be right/helpful/safe too hard, and taking away too much "freedom" in the process.


I appreciate the feedback, another reminder to not lean too much on LLMs.


This also seems to be why rejection sampling + SFT seems just as good if not better in a lot of scenarios




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: