There may not be aha moment in R1-Zero-like training

vessenes · 2025-02-07T07:21:47 1738912907

I sort of flipped between “boring” to “..interesting..” to “maybe boring?” To “possibly interesting?” Reading this.

The meaning of the title is simply that models from most providers can do some self-reflection when prompted to do so, without any R1-Zero type fine tuning.

This is put out as surprising, and I do not think that it is surprising at all. We’ve known about Chain of Thought type prompting for some time, and this is a variant of it.

They then mention that 1.5B-ish parameter models don’t seem to be very good at self reflection out of the box. This does not surprise me in the slightest. Unless heavily distilled and tuned for a very specific job, 1.5B parameter models aren’t good at much.

They then note that something about the reward functions in R1 Zero’s setup create a typical pattern of shorter and shorter self-reflection until some sort of inflection point where the reflection gets longer, and correct answers are more common.

This seems pretty interesting! The so-called “Aha” moment is when a model during training hits this inflection point and starts productively using and extending the self-reflection.

I think my reaction overall is that the research is worth doing, as it’s trying to get at what exactly works about R1-Zero training, and why it works, and that’s great. It’s just a small start though.

Vetch · 2025-02-07T08:21:01 1738916461

The essence of the article is that self-correction exists as a nascent ability in base models already (more robustly in some like Qwen than others). This is highly reminiscent of Chain of Thought, which was found to be a capability already present in base models too. The result of RL is to reinforce already present authentic self-correction patterns and down weight superficial self-correction.

Thoughts:

- An analogy you shouldn't zoom too close into is going from CoT to reasoning traces is like going from purely ballistic trajectories to including navigation and thrusters. RL is for learning how to use the thrusters for adjustments based on its internal encodings of rare samples† where some author fully spelled out their thought process.

- This might also explain why SFT on reasoner traces seems to be surprisingly effective. If it were purely an RL mediated phenomenon, SFT for reasoning would not work nearly as well.

- Deepseek struggled to get RL to work on smaller models, if this is replicated, it might be the case that larger models encode self-correction patterns more robustly while having them as more probable.

- Imitating traces is easier than pure RL for bringing such patterns to the fore, for smaller models. However, we still want models to learn how to dynamically adjust their thrusters, SFT does not provide ample opportunity for this. Further training with RL or alternatively, replacing SFT with methods like [Critique Fine-Tuning](https://arxiv.org/abs/2501.17703) are needed.

- The article incidentally reinforces that having a low temperature means consistency not correctness. Except for high confidence scenarios, the highest greedily computed probability answer is generally less likely to be among the best ones the model can give.

†Question: First thought is blogs by people who discuss what didn't work. But, I wonder how much of reasoning model patterns and ability is shaped by Detective Conan transcripts?

Jean-Papoulos · 2025-02-07T07:10:00 1738912200

>We found Superficial Self-Reflection (SSR) from base models’ responses, in which case self-reflections do not necessarily lead to correct final answers.

I must be missing something here. No one was arguing that the AI answers are correct to begin with, just that self-reflection leads to more correct answers when compared to not using the process ?

littlestymaar · 2025-02-07T07:19:06 1738912746

TL;DR;

Base models exhibit what rhe authors call "Superficial Self-Reflection" where it looks like it's reasoning but it doesn't lead to an actual improvement in answer quality. Then with RL the models learn to effectively use this reflection to improve answer quality.

The whole read is interesting but I don't think the title is really an accurate description of it…

HarHarVeryFunny · 2025-02-07T14:41:42 1738939302

Agreed - it seems this is to be expected. There is bound to naturally be some "reasoning" / self-reflection / self-correction data in the base model training set. For the most part text on the web is going to be the end result of reasoning rather than the process itself (with reflection, correction), but there is bound to be some "but on second thoughts ...", and "that didn't work out, so ..." etc.

I've never seen it myself, but I've heard that Sonnet 3.5 occasionally self-corrects, although nominally not a "reasoning" model (OTOH, Anthropic don't like the "reasoning" label, and prefer to refer to a continuum of abilities). Presumably this is just a reflection of some such data in the training set.

Of course, as with anything else, the LLM is just predicting based on patterns it saw during training, so any "self-reflection" it generates isn't the model itself reflecting, and there is no reason other than luck and prompt vs training set similarity to expect that this will be valid/useful reflection/reasoning.

Where RL comes in is encouraging goal-based behavior (predictions) so that any reasoning step(s) and/or process (entire CoT) are more likely to be coherent and result in a valid CoT, and it seems this will often result in emphasizing reflection and self-correction where needed to keep the CoT on track.

trash_cat · 2025-02-07T12:58:32 1738933112

"...found that the increasing response length phenomenon is not due to the emergence of self-reflection, but a consequence of RL optimizing well-designed rule-based reward functions."

What is the difference?

jamiequint · 2025-02-07T06:59:35 1738911575

Some interesting discussion in the author's X thread here: https://x.com/zzlccc/status/1887557022771712308

benob · 2025-02-07T07:34:57 1738913697

This calls for controlling post-training instruction data from the base model. Does it contain many instances of self-reflection?

Also, has anyone tried non-instruct tuned base models?