> But I think it's inappropriate to claim that models like R1 are "good at deductive or inductive reasoning" when that is demonstrably not true, they are incapable of even the simplest "out-of-distribution" deductive reasoning: https://xcancel.com/JJitsev/status/1883158738661691878
Your link says that R1, not all models like R1, fails at generalization.
Of particular note:
> We expose DeepSeek R1 to the variations of AIW Friends problem and compare model behavior to o1-preview, o1-mini and Claude 3.5 Sonnet. o1-preview handles the problem robustly, DeepSeek R1 shows strong fluctuations across variations with distribution very similar to o1-mini.
I'd expect that OpenAI's stronger reasoning models also don't generalize too far outside of the areas they are trained for. At the end of the day these are still just LLMs, trying to predict continuations, and how well they do is going to depend on how well the problem at hand matches their training data.
Perhaps the type of RL used to train them also has an effect on generalization, but choice of training data has to play a large part.
Nobody generalizes too far outside the areas they're trained for. Probably that length, 'far' is shorter with today's state of the art but the presence of failure modes don't mean anything.
The way the authors talk about LLMs really rubs me the wrong way. They spend more of the paper talking up the 'claims' about LLMs that they are going to debunk than actually doing any interesting study.
They came into this with the assumption that LLMs are just a cheap trick. As a result, they deliberately searched for an example of failure, rather than trying to do an honest assessment of generalization capabilities.
The fact that a tool can break or that the company manufacturing that tool lies about its abilities, are annoying but do not imply that the tool is useless.
I experience LLM "reasoning" failure several times a day, yet I find them useful.
>They came into this with the assumption that LLMs are just a cheap trick. As a result, they deliberately searched for an example of failure, rather than trying to do an honest assessment of generalization capabilities.
And lo and behold, they still found a glaring failure. You can't fault them for not buying into the hype.
But it is still dishonest to declare reasoning LLMs a scam simply because you searched for a failure mode.
If given a few hundred tries, I bet I could find an example where you reason poorly too. Wikipedia has a whole list of common failure modes of human reasoning: https://en.wikipedia.org/wiki/List_of_fallacies
Well, given the success rate is no more than 90% in the best cases. You could probably find a failure in about 10 tries. The only exception is o1-preview. And this is just a simple substitution of parameters.
Your link says that R1, not all models like R1, fails at generalization.
Of particular note:
> We expose DeepSeek R1 to the variations of AIW Friends problem and compare model behavior to o1-preview, o1-mini and Claude 3.5 Sonnet. o1-preview handles the problem robustly, DeepSeek R1 shows strong fluctuations across variations with distribution very similar to o1-mini.