GPT-3 was trained on 4:1 ratio of data to parameters. And for GPT-4 the ratio was 10:1. So to scale this out, GPT-5 should be 25:1. The parameter count jumped from 175B to 1.3T, which means GPT-5 should be 10T parameters and 250T training tokens. There is zero chance OpenAI has a training set of high quality data that is 250T tokens.
If I had to guess, they trained a model that was maybe 3-4T in size and used 30-50T high quality tokens and maybe 10-30 medium and low quality ones.
There is only one company in the world that stores the data that could get us past the wall.
The training cost of the above scaled GPT-5 is 150x GPT-4, which was 25k A100 for 90 days, which poor MFU.
Let’s assume they double MFU, it would mean 1M H100s. But let’s say they made algorithmic improvements, so maybe it’s only 250-500k H100s.
While the training cluster size was 100k and then grew to 150k, this cluster is suggestive of a smaller model and less data.
GPT-3 was trained on 4:1 ratio of data to parameters. And for GPT-4 the ratio was 10:1. So to scale this out, GPT-5 should be 25:1. The parameter count jumped from 175B to 1.3T, which means GPT-5 should be 10T parameters and 250T training tokens. There is zero chance OpenAI has a training set of high quality data that is 250T tokens.
If I had to guess, they trained a model that was maybe 3-4T in size and used 30-50T high quality tokens and maybe 10-30 medium and low quality ones.
There is only one company in the world that stores the data that could get us past the wall.
The training cost of the above scaled GPT-5 is 150x GPT-4, which was 25k A100 for 90 days, which poor MFU.
Let’s assume they double MFU, it would mean 1M H100s. But let’s say they made algorithmic improvements, so maybe it’s only 250-500k H100s.
While the training cluster size was 100k and then grew to 150k, this cluster is suggestive of a smaller model and less data.
But ultimately data is the bottleneck.