I sort of flipped between “boring” to “..interesting..” to “maybe boring?” To “possibly interesting?” Reading this.
The meaning of the title is simply that models from most providers can do some self-reflection when prompted to do so, without any R1-Zero type fine tuning.
This is put out as surprising, and I do not think that it is surprising at all. We’ve known about Chain of Thought type prompting for some time, and this is a variant of it.
They then mention that 1.5B-ish parameter models don’t seem to be very good at self reflection out of the box. This does not surprise me in the slightest. Unless heavily distilled and tuned for a very specific job, 1.5B parameter models aren’t good at much.
They then note that something about the reward functions in R1 Zero’s setup create a typical pattern of shorter and shorter self-reflection until some sort of inflection point where the reflection gets longer, and correct answers are more common.
This seems pretty interesting! The so-called “Aha” moment is when a model during training hits this inflection point and starts productively using and extending the self-reflection.
I think my reaction overall is that the research is worth doing, as it’s trying to get at what exactly works about R1-Zero training, and why it works, and that’s great. It’s just a small start though.
The meaning of the title is simply that models from most providers can do some self-reflection when prompted to do so, without any R1-Zero type fine tuning.
This is put out as surprising, and I do not think that it is surprising at all. We’ve known about Chain of Thought type prompting for some time, and this is a variant of it.
They then mention that 1.5B-ish parameter models don’t seem to be very good at self reflection out of the box. This does not surprise me in the slightest. Unless heavily distilled and tuned for a very specific job, 1.5B parameter models aren’t good at much.
They then note that something about the reward functions in R1 Zero’s setup create a typical pattern of shorter and shorter self-reflection until some sort of inflection point where the reflection gets longer, and correct answers are more common.
This seems pretty interesting! The so-called “Aha” moment is when a model during training hits this inflection point and starts productively using and extending the self-reflection.
I think my reaction overall is that the research is worth doing, as it’s trying to get at what exactly works about R1-Zero training, and why it works, and that’s great. It’s just a small start though.