Hacker News new | past | comments | ask | show | jobs | submit login

I did a few tests and asked it some legal questions. 4o gave me the correct answer immediately.

o1 preview gave a much more in depth but completely wrong answer. It took 5 follow ups to get it to recognize that it hallucinated a non-existent law




That is very interesting. Would you mind testing the same prompt with Claude Sonnet 3.5 and Opus? If not available to you, would you be willing to share the prompt/question? Thank you.


This is interesting since they claim it does well on STEM questions, which I’d assume would be a similar level of reasoning complexity for a human.


This is an interesting one because math is doing so much of the heavy lifting. And symbolic math has a far smaller representational space than numerical math.

There is one other wonderful thing about symbolic math, the glorious '=' sign. It's structured everywhere from top-to-bottom, left-to-right, which is amenable to the next token prediction behavior and multi-attention heads of transformer based LLMs.

My guess is that problem statement formation into an equation is as difficult of a problem for these as actually running through the equations. However, having taken the Physics GRE, and knowing they try for parity of difficulty between years (even though they normalize it), the problems are fairly standard and have permutations of a problem type between the years.

This is not to diminish how cool this is, just that standardized tests do have an element of predictability to them. I find this result actually neat though; it's an actual qualitative improvement over non-CoT LLMs, even if things like Mathematica can do the steps more reliably post problem formation. I think that judiciously used, this is a valuable feature.


A difficult to guess fraction of all of these results are training to the test in various forms


Perhaps the smaller model used in o1 is over trained on arxiv and code relative to 4o (or undertrained on legal text)




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: