They also said in the paper that 405B was only trained to "compute-optimal" unlike the smaller models that were trained well past that point indicating the larger model still had some runway so had they continued it would have kept getting stronger.
Makes sense right? Otherwise why make a model so large that nobody can conceivably run it if not to optimize for performance on a limited dataset/compute? It was always a distillation source model, not a production one.