Yeah, but what does that actually mean? That if they had simply doubled the para...

Yeah, but what does that actually mean? That if they had simply doubled the parameters on Llama 405b it would score way better on benchmarks and become the new state-of-the-art by a long mile?

I mean, going by their own model evals on various benchmarks (https://llama.meta.com/), Llama 405b scores anywhere from a few points to almost 10 points more than than Llama 70b even though the former has ~5.5x more params. As far as scale in concerned, the relationship isn't even linear.

Which in most cases makes sense, you obviously can't get a 200% on these benchmarks, so if the smaller model is already at ~95% or whatever then there isn't much room for improvement. There is, however, the GPQA benchmark. Whereas Llama 70b scores ~47%, Llama 405b only scores ~51%. That's not a huge improvement despite the significant difference in size.

Most likely, we're going to see improvements in small model performance by way of better data. Otherwise though, I fail to see how we're supposed to get significantly better model performance by way of scale when the relationship between model size and benchmark scores is nowhere near linear. I really wish someone who's team "scale is all you need" could help me see what I'm missing.

And of course we might find some breakthrough that enables actual reasoning in models or whatever, but I find that purely speculative at this point, anything but inevitable.