Not OP, but likely because that was the only metric/benchmark/however you want t...

Not OP, but likely because that was the only metric/benchmark/however you want to call it OpenAI showcased in the stream and on the blog to highlight the improvement between 4o and 4.5. To say that this is not really a good metric for comparison, not least because prompting can have a massive impact in this regard, would be an understatement.