That is very interesting. Would you mind testing the same prompt with Claude Sonnet 3.5 and Opus? If not available to you, would you be willing to share the prompt/question? Thank you.
This is an interesting one because math is doing so much of the heavy lifting. And symbolic math has a far smaller representational space than numerical math.
There is one other wonderful thing about symbolic math, the glorious '=' sign. It's structured everywhere from top-to-bottom, left-to-right, which is amenable to the next token prediction behavior and multi-attention heads of transformer based LLMs.
My guess is that problem statement formation into an equation is as difficult of a problem for these as actually running through the equations. However, having taken the Physics GRE, and knowing they try for parity of difficulty between years (even though they normalize it), the problems are fairly standard and have permutations of a problem type between the years.
This is not to diminish how cool this is, just that standardized tests do have an element of predictability to them. I find this result actually neat though; it's an actual qualitative improvement over non-CoT LLMs, even if things like Mathematica can do the steps more reliably post problem formation. I think that judiciously used, this is a valuable feature.
o1 preview gave a much more in depth but completely wrong answer. It took 5 follow ups to get it to recognize that it hallucinated a non-existent law