They actually have [0]. They were revealed to have had access to the (majority of the) frontierMath problemset while everybody thought the problemset was confidential, and published benchmarks for their o3 models on the presumption that they didn't. I mean one is free to trust their "verbal agreement" that they did not train their models on that, but access they did have and it was not revealed until much later.
Curious you left out Frontier Math’s statement that they provided 300 questions plus answers, and another holdback set of 50 questions without answers, to allay this concern. [0]
We can assume they’re lying too but at some point “everyone’s bad because they’re lying, which we know because they’re bad” gets a little tired.
1. I said the majority of the problems, and the article I linked also mentioned this. Nothing “curious” really, but if you thought this additional source adds sth more, thanks for adding it here.
2. We know that “open”ai is bad, for many reasons, but this is irrelevant. I want processes themselves to not depend on the goodwill of a corporation to give intended results. I do not trust benchmarks that first presented themselves secret and then revealed they were not, regardless if the product benchmarked was from a company I otherwise trust or not.
Fair enough. It’s hard for me to imagine being so offended as the way they screwed up disclosure that I’d reject empirical data, but I get that it’s a touchy subject.
When the data is secret and unavailable to the company before the test, it doesn’t rely on me trusting the company. When the data is not secret and is available to the company, I have to trust that the company did not use that prior knowledge to their advantage. When the company lies and says it did not have access, then later admits that it did have access, is means the data is less trustworthy from my outsider perspective. I don’t think “offense” is a factor at all.
If a scientific paper comes out with “empirical data”, I will still look at the conflicts of interest section. If there are no conflicts of interest listed, but then it is found out that there are multiple conflicts of interest, but the authors promise that while they did not disclose them, they also did not affect the paper, I would be more skeptical. I am not “offended”. I am not “rejecting” the data, but I am taking those factors into account when determining how confident I can be in the validity of the data.
> When the company lies and says it did not have access, then later admits that it did have access, is means the data is less trustworthy from my outsider perspective.
This isn't what happened? I must be missing something.
AFAIK:
The FrontierMath people self-reported they had a shared folder the OpenAI people had access to that had a subset of some questions.
No one denied anything, no one lied about anything, no one said they didn't have access. There was no data obtained under the table.
The motte is "they had data for this one benchmark"
You're right, upon reflection, it seems there might be some misunderstandings here:
Motte and Bailey refers to an argumentative tactic where someone switches between an easily defensible ("motte") position and a less defensible but more ambitious ("bailey") position. My example should have been:
- Motte (defensible): "They had access to benchmark data (which isn't disputed)."
- Bailey (less defensible): "They actually trained their model using the benchmark data."
The statements you've provided:
"They got caught getting benchmark data under the table" (suggesting improper access)
"One is free to trust their 'verbal agreement' that they did not train their models on that, but access they did have."
These two statements are similar but not logically identical. One explicitly suggests improper or secretive access ("under the table"), while the other acknowledges access openly.
So, rather than being logically identical, the difference is subtle but meaningful. One emphasizes improper access (a stronger claim), while the other points only to possession or access, a more easily defensible claim.
FrontierMath benchmark people saying OpenAI had shared folder access to some subset of eval Qs, which has been replaced, take a few leaps, and yes, that's getting "data under the table" - but, those few leaps! - and which, let's be clear, is the motte here.
This is nonsense, obviously the problem with getting "data under the table" is that they may have used it to training their models, thus rendering the benchmarks invalid. But for this danger, there is no other risk for them having access to it beforehand. We do not know if they used it for training, but the only reassurance being some "verbal agreement", as is reported, is not very reassuring. People are free to adjust their P(model_capabilities|frontiermath_results) based on their own priors.
> obviously the problem with getting "data under the table" is that they may have used it to training their models
I've been avoiding mentioning the maximalist version of the argument (they got data under the table AND used it to train models), because training wasn't stated until now, and it would have been unfair to bring it up without mention. That is that's 2 baileys out from "they had access to a shared directory that had some test qs in it, and this was reported publicly, and fixed publicly"
There's been a fairly severe communication breakdown here, I don't want to distract from ex. what the nonense is, so I won't belabor that point, but I don't want you to think I don't want to engage on it - just won't in this singular posts.
> but the only reassurance being some "verbal agreement", as is reported, is not very reassuring
It's about as reassuring as it gets without them releasing the entire training data, which is, at best, with charity marginally, oh so marginally reassuring I assume? If the premise is we can't trust anything self-reported, they could lie there too?
> People are free to adjust their P(model_capabilities|frontiermath_results) based on their own priors.
Certainly, that's not in dispute (perhaps the idea that you are forbidden from adjusting your opinion is the nonsense you're referring to? I certainly can't control that :) Nor would I want to!)
What is nonsense is the suggestion that there is a "reasonable" argument that they had access to the data (which we now know), and an "ambitious" argument that they used the data. But nobody said that they know for certain that the data was used, this is a strawman argument. We are talking that now there is a non-zero probability that it was. This is obviously what we have been discussing since the beginning, else we would not care whether they had access or not and it would not have been mentioned. There is a simple, single argument made here in this thread.
And FFS I assume the dispute is about the P given by people, not about if people are allowed to have a P.