Your argumentation seems convincing but I'd like to offer a competitive narrative: any benchmark that is public becomes completely useless because companies optimize for it - especially AI that depends on piles of money and they need some proof they are developing.
That's why I have some private benchmarks and I'm sorry to say that the transition from GTP4 to o1 wasn't unambiguously a step forward (in some tasks yes, in some not).
On the other hand, private benchmarks are even less useful to the general public than the public ones, so we have to deal with what we have - but many of us just treat it as noise and don't give it much significance. Ultimately, the models should defend themselves by performing the tasks individual users want them to do.
Rather any Logic puzzle you post on the internet as something AIs are bad at is in the next round of training data so AIs get better at that specific question. Not because AI companies are optimizing for a benchmark but because they suck up everything.
ARC has two test sets that are not posted on the Internet. One is kept completely private and never shared. It is used when testing open source models and the models are run locally with no internet access. The other test set is used when testing closed source models that are only available as APIs. So it could be leaked in theory, but it is still not posted on the internet and can't be in any web crawls.
You could argue that the models can get an advantage by looking at the training set which is on the internet. But all of the tasks are unique and generalizing from the training set to the test set is the whole point of the benchmark. So it's not a serious objection.
That's why they have two test sets. But OpenAI has legally committed to not training on data passed to the API. I don't believe OpenAI would burn their reputation and risk legal action just to cheat on ARC. And what they've reported is not implausible IMO.
Yeah I'm sure the Microsoft-backed company headed by Mr. Worldcoin Altman whose sole mission statement so far has been to overhype every single product they released wouldn't dare cheat on one of these benchmarks that "prove" AGI (as they've been claiming since GPT-2).
That's why I have some private benchmarks and I'm sorry to say that the transition from GTP4 to o1 wasn't unambiguously a step forward (in some tasks yes, in some not).
On the other hand, private benchmarks are even less useful to the general public than the public ones, so we have to deal with what we have - but many of us just treat it as noise and don't give it much significance. Ultimately, the models should defend themselves by performing the tasks individual users want them to do.