Hacker News new | past | comments | ask | show | jobs | submit login

I thought with this chain-of-thought approach the model might be better suited to solve a logic puzzle, e.g. ZebraPuzzles [0]. It produced a ton of "reasoning" tokens but hallucinated more than half of the solution with names/fields that weren't available. Not a systematic evaluation, but it seems like a degradation from 4o-mini. Perhaps it does better with code reasoning problems though -- these logic puzzles are essentially contrived to require deductive reasoning.

[0] https://zebrapuzzles.com




Hey, I run ZebraPuzzles.com, thanks for mentioning it! Right now I'm trying to improve the puzzles so that people can't "cheat" using LLMs so easily ;-).


It's fantastic! Thanks for the great work.


Thank you so much!


o1-mini does better than any other model on zebra puzzles. Maybe you got unlucky on one question?

https://www.reddit.com/r/LocalLLaMA/comments/1ffjb4q/prelimi...


Entirely possible. I did not try to test systematically or quantitatively, but it's been a recurring easy "demo" case I've used with releases since 3.5-turbo.

The super verbose chain-of-reasoning that o1 does seems very well suited to logic puzzles as well, so I expected it to do reasonably well. As with many other LLM topics, though, the framing of the evaluation (or the templating of the prompt) can impact the results enormously.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: