You need to be prepared for the reality that naive scaling no longer works for L...

impossiblefork · 2025-01-27T16:04:10 1737993850

It is a possibility, but my understanding of what OpenAI has said is that GPT-5 is delayed because of the apparent promise of RL trained things like o1, etc. and that they've simply decided to train those instead of training a bigger base model training on better data, and I think this is plausible.

nialv7 · 2025-01-27T20:34:35 1738010075

OpenAI has an incentive to make people believe that the scaling laws are still alive, to justify their enormous capex if nothing else.

I wouldn't give what they say to much credence, and will only believe the results I see.

impossiblefork · 2025-01-27T22:14:48 1738016088

Yes, I think I agree that it seems unlikely that the spending they're doing can be recouped.

But it can still make sense for a state, even if it doesn't make sense for investors though.

Jlagreen · 2025-01-27T15:13:36 1737990816

If we expect that the demand for GPT-5 in AI compute is 100x of that of GPT-4 then if GPT-4 was trained in months on 10k of H100 then you would need years with 100k of H100 or maybe again months with 100k of GB200.

See, there is your answer. The issue is the compute of GPUs is way to low yet for GPT-5 if they continue parameter scaling as they used to do.

GPT3 took months on 10k A100s. 10k H100 would have done it in a fraction of a time. Blackwell could train GPT4 in 10 days with same amount of GPUs as Hopper which took months.

Don't forget GPT3 is just 2.5 years old. Training is obviously waiting for the next step up in large clusters of training speed increasement. Don't be fooled, the 2x Blackwell vs. Hopper is only chip vs. chip. 10k of Blackwell including all networking speedup is easily 10x or more faster than the same amount of Hopper. So building a 1 million Blackwell cluster means 100x more training compute compared to a 100k Hopper cluster.

Nobody starts a model training if it takes years to finish... too much risk in that.

Transfer model was introduced in 2017 and ChatGPT came out 2022. Why? Because they would have needed millions of Volta GPUs instead of thousands of Ampere GPUs to train it.

r00fus · 2025-01-27T19:00:35 1738004435

There is a theory that Deepseek gives based on it's distillation process that hints towards, that o1 is really a distillation of a bigger GPT (GPT5?).

Some consider this to be spurious/conspiracy.

impossiblefork · 2025-01-27T22:20:25 1738016425

There is a big model from NVIDIA that I assume is for this purpose, i.e. Megatron 530b, so it doesn't sound too unreasonable.

Edit: I assumed that the model was distillation, that is apparently not true.