Hacker News new | past | comments | ask | show | jobs | submit login

I'm noticing a strange common theme in all these riddles, it's being asked and getting wrong.

They're all badly worded questions. The model knows something is up and reads into it too much. In this case it's tautology, you would usually say "a mother and her son...".

I think it may answer correctly if you start off asking "Please solve the below riddle:"

There was another example yesterday which it solved correctly after this addition.(In that case the point of views were all mixed up, it only worked as a riddle).




> They're all badly worded questions. The model knows something is up and reads into it too much. The model knows something is up and reads into it too much. In this case it's tautology, you would usually say "a mother and her son...".

How is "a woman and her son" badly worded? The meaning is clear and blatently obvious to any English speaker.


Go read the whole riddle, add the rest of it and you'll see it's contrived, hence it's a riddle even for humans. The model in it's thinking (which you can read) places undue influence on certain anomalous factors. In practice, a person would say this way more eloquently than the riddle.


Yup. The models fail on gotcha questions asked without warning, especially when evaluated on the first snap answer. Much like approximately all humans.


> especially when evaluated on the first snap answer

The whole point of o1 is that it wasn't "the first snap answer", it wrote half a page internally before giving the same wrong answer.


Is that really its internal 'chain of thought' or is it a post-hoc justification generated afterward? Do LLMs have a chain of thought like this at all or are they just convincing at mimicking what a human might say if asked for a justification for an opinion?


Its slightly more strange than this as both are true. It's already baked in the model but chain of thought does improve reasoning, you only have to look at maths problems. A short guess would be wrong but it would get it correct if asked to break it down and reason (harder to see nowadays as it has access to calculators).




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: