That's the equivalent to what we are asking the model to do. If you give the model a calculator it will get 100%. If you give it a pen and paper (e.g. let it show it's working) then it will get near 100%.
> That's the equivalent to what we are asking the model to do.
Why?
What does it mean to give a model a calculator?
What do you mean “let it show its working”? If I ask an LLM to do a calculation, I never said it can’t express the answer to me in long-form text or with intermediate steps.
If I ask a human to do a calculation that they can’t reliably do in their head, they are intelligent enough to know that they should use a pen and paper without needing my preemptive permission.