The problem with ARC is that there are a finite number of heuristics that could ...

The problem with ARC is that there are a finite number of heuristics that could be enumerated and trained for, which would give model a substantial leg up on this evaluation, but not be generalized to other domains.

For example, if they produce millions of examples of the type of problems o3 still struggles on, it would probably do better at similar questions.

Perhaps the private data set is different enough that this isn’t a problem, but the ideal situation would be unveiling a truly novel dataset, which it seems like arc aims to do.