Two heads is better than 1. 10 is way better. Even if they aren't a field of exp...

inerte · 2024-12-20T21:50:03 1734731403

Aaaah thanks for the explanation. PANEL of 10 humans, as in, they were all together. I parsed the phrase as "10 random people" > "average human" which made little sense.

modeless · 2024-12-20T22:04:05 1734732245

Actually I believe that he did mean 10 random people tested individually, not a committee of 10 people. The key being that the question is considered to be answered correctly if any one of the 10 people got it right. This is similar to how LLMs are evaluated with pass@5 or pass@10 criteria (because the LLM has no memory so running it 10 times is more like asking 10 random people than asking the same person 10 times in a row).

I would expect 10 random people to do better than a committee of 10 people because 10 people have 10 chances to get it right while a committee only has one. Even if the committee gets 10 guesses (which must be made simultaneously, not iteratively) it might not do better because people might go along with a wrong consensus rather than push for the answer they would have chosen independently.

elcomet · 2024-12-20T23:26:54 1734737214

He means 10 humans voting for the answer

generic92034 · 2024-12-21T00:45:05 1734741905

If that works that way at all depends on the group dynamic. It is easily possible that a not so bright individual takes an (unofficial) leadership position in the group and overrides the input of smarter members. Think of any meetings with various hierarchy levels in a company.

daveguy · 2024-12-21T16:23:25 1734798205

The ARC AGI questions can be a little tricky, but the solutions can generally be easily explained. And you get 3 tries. So, the 3 best descriptions of the solution votes on by 10 people is going to be very effective. The problem space just isn't complicated enough for an unofficial "leader" to sway the group to 3 wrong answers.

herval · 2024-12-21T00:12:27 1734739947

Depends on the task, no?

Do you have a sense of what kind of task this benchmark includes? Are they more “general” such that random people would fare well or more specialized (ie something a STEM grad studied and isn’t common knowledge)?

judge2020 · 2024-12-21T00:23:18 1734740598

It does, which is why I don’t really subscribe to any test like this being great for actually determining “AGI”. A true AGI would be able to continuously train and create new LLMs that enable it to become a SME in entirely new areas.

zamadatix · 2024-12-21T15:00:09 1734793209

Aha, "at least 1 of a panel of 10", not "the panel of 10 averaged"! Thanks, that makes so much more sense to me now.

I have failed the real ARC AGI :)