> But I think it's inappropriate to claim that models like R1 are "good at deduc...

HarHarVeryFunny · 2025-02-07T00:34:31 1738888471

I'd expect that OpenAI's stronger reasoning models also don't generalize too far outside of the areas they are trained for. At the end of the day these are still just LLMs, trying to predict continuations, and how well they do is going to depend on how well the problem at hand matches their training data.

Perhaps the type of RL used to train them also has an effect on generalization, but choice of training data has to play a large part.

og_kalu · 2025-02-07T02:13:45 1738894425

Nobody generalizes too far outside the areas they're trained for. Probably that length, 'far' is shorter with today's state of the art but the presence of failure modes don't mean anything.

Legend2440 · 2025-02-06T23:37:08 1738885028

The way the authors talk about LLMs really rubs me the wrong way. They spend more of the paper talking up the 'claims' about LLMs that they are going to debunk than actually doing any interesting study.

They came into this with the assumption that LLMs are just a cheap trick. As a result, they deliberately searched for an example of failure, rather than trying to do an honest assessment of generalization capabilities.

o11c · 2025-02-07T00:32:37 1738888357

What the hype crowd doesn't get is that for most people, "a tool that randomly breaks" is not useful.

rixed · 2025-02-07T05:21:53 1738905713

The fact that a tool can break or that the company manufacturing that tool lies about its abilities, are annoying but do not imply that the tool is useless.

I experience LLM "reasoning" failure several times a day, yet I find them useful.

suddenlybananas · 2025-02-07T00:00:48 1738886448

>They came into this with the assumption that LLMs are just a cheap trick. As a result, they deliberately searched for an example of failure, rather than trying to do an honest assessment of generalization capabilities.

And lo and behold, they still found a glaring failure. You can't fault them for not buying into the hype.

Legend2440 · 2025-02-07T00:06:37 1738886797

But it is still dishonest to declare reasoning LLMs a scam simply because you searched for a failure mode.

If given a few hundred tries, I bet I could find an example where you reason poorly too. Wikipedia has a whole list of common failure modes of human reasoning: https://en.wikipedia.org/wiki/List_of_fallacies

daveguy · 2025-02-07T00:44:21 1738889061

Well, given the success rate is no more than 90% in the best cases. You could probably find a failure in about 10 tries. The only exception is o1-preview. And this is just a simple substitution of parameters.