I was having a look at the DeepSeek-R1 technical report and found the "aha moment" claims quite smelly, given that they do not disclose if the base model contains any chain of thought or reasoning data.
However, we know the base model is DeepSeek V3. From the DeepSeek V3 technical report, paragraph in 5.1. Supervised Fine-Tuning:
> Reasoning Data. For reasoning-related datasets, including those focused on mathematics, code competition problems, and logic puzzles, we generate the data by leveraging an internal DeepSeek-R1 model. Specifically, while the R1-generated data demonstrates strong accuracy, it suffers from issues such as overthinking, poor formatting, and excessive length. Our objective is to balance the high accuracy of R1-generated reasoning data and the clarity and conciseness of regularly formatted reasoning data.
In 5.4.1 they also talk about some ablation experiment by not using the "internal DeepSeek-R1" generated data.
While the "internal DeepSeek-R1" model is not explained, I would assume this is a DeepSeek V2 or V2.5 tuned for chain of thought. Therefore, it seems to me the "aha moment" is just promoting the behaviour that was already present in V3.
In the "Self-evolution Process of DeepSeek-R1-Zero"/ Figure 3 they claim reinforcement learning also leads to the model generating longer CoT sequences, but again, this comes from V3, they even mention the fine tuning with "internal R1" led to "excessive length".
None of the blogpost, news, articles I have read explaining or commenting on DeepSeek R1 takes this into account. The community is scrambling to re-implement the pipeline (see open-r1).
At this point, I feel like I took a crazy pill. Am I interpreting this completely wrong? Can someone shed some light on this?
I'm also very skeptical of the significance of this "aha moment". Even if they didn't include chain-of-thoughts to the base model's training data (unlikely), there are still plenty of it on the modern Internet. OpenAI released 800k of reasoning steps which are publicly available, github repositories, examples in CoT papers... It's definitely not a novel concept for a model, that it somehow discovered by its own.
However, we know the base model is DeepSeek V3. From the DeepSeek V3 technical report, paragraph in 5.1. Supervised Fine-Tuning:
> Reasoning Data. For reasoning-related datasets, including those focused on mathematics, code competition problems, and logic puzzles, we generate the data by leveraging an internal DeepSeek-R1 model. Specifically, while the R1-generated data demonstrates strong accuracy, it suffers from issues such as overthinking, poor formatting, and excessive length. Our objective is to balance the high accuracy of R1-generated reasoning data and the clarity and conciseness of regularly formatted reasoning data.
In 5.4.1 they also talk about some ablation experiment by not using the "internal DeepSeek-R1" generated data.
While the "internal DeepSeek-R1" model is not explained, I would assume this is a DeepSeek V2 or V2.5 tuned for chain of thought. Therefore, it seems to me the "aha moment" is just promoting the behaviour that was already present in V3.
In the "Self-evolution Process of DeepSeek-R1-Zero"/ Figure 3 they claim reinforcement learning also leads to the model generating longer CoT sequences, but again, this comes from V3, they even mention the fine tuning with "internal R1" led to "excessive length".
None of the blogpost, news, articles I have read explaining or commenting on DeepSeek R1 takes this into account. The community is scrambling to re-implement the pipeline (see open-r1).
At this point, I feel like I took a crazy pill. Am I interpreting this completely wrong? Can someone shed some light on this?