I did a few tests and asked it some legal questions. 4o gave me the correct answ...

AhtiK · 2024-09-13T08:04:27 1726214667

That is very interesting. Would you mind testing the same prompt with Claude Sonnet 3.5 and Opus? If not available to you, would you be willing to share the prompt/question? Thank you.

elicksaur · 2024-09-13T02:58:06 1726196286

This is interesting since they claim it does well on STEM questions, which I’d assume would be a similar level of reasoning complexity for a human.

abernard1 · 2024-09-13T07:15:04 1726211704

This is an interesting one because math is doing so much of the heavy lifting. And symbolic math has a far smaller representational space than numerical math.

There is one other wonderful thing about symbolic math, the glorious '=' sign. It's structured everywhere from top-to-bottom, left-to-right, which is amenable to the next token prediction behavior and multi-attention heads of transformer based LLMs.

My guess is that problem statement formation into an equation is as difficult of a problem for these as actually running through the equations. However, having taken the Physics GRE, and knowing they try for parity of difficulty between years (even though they normalize it), the problems are fairly standard and have permutations of a problem type between the years.

This is not to diminish how cool this is, just that standardized tests do have an element of predictability to them. I find this result actually neat though; it's an actual qualitative improvement over non-CoT LLMs, even if things like Mathematica can do the steps more reliably post problem formation. I think that judiciously used, this is a valuable feature.

waveBidder · 2024-09-13T04:35:46 1726202146

A difficult to guess fraction of all of these results are training to the test in various forms

m101 · 2024-09-13T21:45:45 1726263945

Perhaps the smaller model used in o1 is over trained on arxiv and code relative to 4o (or undertrained on legal text)