Hacker News new | past | comments | ask | show | jobs | submit login

I am beginning to think these human eval tests are a waste of time at best, and negative value at worst. Maybe I am being snobby, but I don't think the average human is able to properly evaluate usefulness, truthfulness, or other metrics that I actually care about. I am sure this is good for openAI since if more people like what the hear, they are more likely come back.

I don't want my AI more obsequious, I want it more correct and capable.

My only use case is coding though, so maybe I am not representative of their usual customers?




> I want it more correct and capable.

How is it supposed to be more correct and capable if these human eval tests are a waste of time?

Once you ask it to do more than add two numbers together, it gets a lot more difficult and subjective to determine whether it's correct and how correct.


I agree it's a hard problem. I think there are a number of tests out there however that are able to objectively test capability and truthfulness.

I've read reports that some of the changes that are preferred by human evaluators actually hurt the performance on the more objective tests.


Please tell me how we objectively determine how correct something is when you ask an LLM: "Was Russia the aggressor in the current Ukraine / Russia conflict?"

One LLM says: "Yes."

The other says: "Well, it's hard to say because what even is war? And there's been conflict forever, and you have to understand that many people in Russia think there is no such thing as Ukraine and it's always actually just been Russia. How can there be an aggressor if it's not even a war, just a special operation in a civil conflict? And, anyway, Russia is such a good country. Why would it be the aggressor? To it's own people even!? Vladimir Putin is the president of Russia, and he's known to be a kind and just genius who rarely (if ever) makes mistakes. Some people even think he's the second coming of Christ. President Zelenskyy, on the other hand, is considered by many in Russia and even the current White House to be a dictator. He's even been accused by Elon Musk of unspeakable sex crimes. So this is a hard question to answer and there is no consensus among everyone who was the aggressor or what started the conflict. But more people say Russia started it."


Because Russia did undeniably open hostilities? They even admitted to this both times. The second admission being in the form of announcing a “special military operation” when the ceasefire was still active. We also have photographic evidence of them building forces on a border during a ceasefire and then invading. This is like responding to: “did Alexander the Great invade Egypt” by going on a diatribe about how much war there was in the ancient world and that the ptolemaic dynasty believed themselves the rightful rulers therefore who’s to say if they did invade or just take their rightful place. There is an objective record here: whether or not people want to try and hide it behind circuitous arguments is different. If we’re going down this road I can easily redefine any known historical event with hand-wavy nonsense that doesn’t actually have anything to do with the historical record of events just “vibes.”


Okay - but EXACTLY how wrong (or not correct) is the second answer?

Please tell me precisely on a 0-1 floating scale, where 0 is "yes" and "no".


One might say, if this were a test being done by a human in a history class, that the answer is 100% incorrect given the actual record of events and failure of statement to mention that actual record. You can argue the causes but that’s not the question.


We'll agree to disagree. /s


These eval tests are just an anchor point to measure distance from, but it's true, picking the anchor point is important. We don't want to measure in the wrong direction.


The SuperTuring era.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: