Hacker News new | past | comments | ask | show | jobs | submit login

I like Raschka's writing, even if he is considerably more optimistic about this tech than I am. But I think it's inappropriate to claim that models like R1 are "good at deductive or inductive reasoning" when that is demonstrably not true, they are incapable of even the simplest "out-of-distribution" deductive reasoning: https://xcancel.com/JJitsev/status/1883158738661691878

They are certainly capable of doing is a wide variety of computations that simulate reasoning, and maybe that's good enough for your use case. But it is unpredictably brittle unless you spend a lot on o1-pro (and even then...). Raschka has a line about "whether and how an LLM actually 'thinks' is a separate discussion" but this isn't about semantics. R1 clearly sucks at deductive reasoning and you will not understand "reasoning" LLMs if you take DeepSeek's claims at face value.

It seems especially incurious for him to copy-paste the "a-ha moment" from Deepseek's technical report without critically investigating it. DeepSeek's claims are unscientific, without real evidence, and seem focused on hype and investment:

  This moment is not only an "aha moment" for the model but also for the researchers observing its behavior. It underscores the power and beauty of reinforcement learning: rather than explicitly teaching the model on how to solve a problem, we simply provide it with the right incentives, and it autonomously develops advanced problem-solving strategies. 

  The "aha moment" serves as a powerful reminder of the potential of RL to unlock new levels of intelligence in artificial systems, paving the way for more autonomous and adaptive models in the future.
Perhaps it was able to solve that tricky Olympiad problem, but there are an infinite variety of 1st grade math problems it is not able to solve. I doubt it's even reliably able to solve simple variations of that root problem. Maybe it is! But it's frustrating how little skepticism there is about CoT, reasoning traces, etc.



> they are incapable of even the simplest "out-of-distribution" deductive reasoning

But the link demonstrates the opposite- these models absolutely are able to reason out of distribution, just not with perfect fidelity. The fact that they can do better than random is itself really impressive. And o1-preview does impressively well, only vary rarely getting the wrong answer on variants of that Alice in Wonderland problem.

If you would listen to most of the people critical of LLMs saying they're a "stochastic parrot" - it should be impossible for them to do better than random on any out of distribution problem. Even just changing one number to create a novel math problem should totally stump them and result in entirely random outputs, but it does not.

Overall, poor reasoning that is better than random but frequently gives the wrong answer is fundamentally, categorically entirely different from being incapable of reasoning.


anyone saying an LLM is a stochastic parrot doesn't understand them... they are just parroting what they heard.


A good literary production. I would have been proud of it had I thought of it, but it's a path to observe a strong "whataboutery" element that if we use "stochastic parrot" as shorthand and you dislike the term, now you understand why we dislike the constant use of "infer", "reason" and "hallucinate"

Parrots are self aware, complex reasoning brains which can solve problems in geometry, tell lies, and act socially or asocially. They also have complex vocal chords and can perform mimicry. Very few aspects of a parrots behaviour are stochastic but that also underplays how complex stochastic systems can be in their production. If we label LLM products as Stochastic Parrots it does not mean they like cuttlefish bones or are demonstrably modelled by Markov chains like Mark V Shaney.


Well parrots can make more parrots, LLMs can't make their own GPUs. So parrots win, but LLMs can interpolate and even extrapolate a little, have you ever heard a parrot do translation, hearing you say something in English and translating it to Spanish? Yes, LLMs are not parrots. Besides their debatable abilities, they work with human in the loop, which means humans push them outside their original distribution. That's not a parroting act, being able to do more than pattern matching and reproduction.


LLMs can easily order more GPUs over the internet, hire people to build a datacenter and reproduce.

Or, more simply.. just hack into a bunch of aws accounts, spin up machines, boom.


I don't like wading into this debate when semantics are very personal/subjective. But to me, it seems like almost a sleight of hand to add the stochastic part, when actually they're possibly weighted more on the parrot part. Parrots are much more concrete, whereas the term LLM could refer to the general architecture.

The question to me seems: If we expand on this architecture (in some direction, compute, size etc.), will we get something much more powerful? Whereas if you give nature more time to iterate on the parrot, you'd probably still end up with a parrot.

There's a giant impedance mismatch here (time scaling being one). Unless people want to think of parrots being a subset of all animals, and so 'stochastic animal' is what they mean. But then it's really the difference of 'stochastic human' and 'human'. And I don't think people really want to face that particular distinction.


"Expand the architecture" .. "get something much more powerful" .. "more dilithium crystals, captain"

Like I said elsewhere in this overall thread, we've been here before. Yes, you do see improvements in larger datasets, weighted models over more inputs. I suggest, I guess I believe (to be more honest) that no amount of "bigger" here will magically produce AGI simply because of the scale effect.

There is no theory behind "more" and that means there is no constructed sense of why, and the absence of abstract inductive reasoning continues to say to me, this stuff isn't making a qualitative leap into emergent anything.

It's just better at being an LLM. Even "show your working " is pointing to complex causal chains, not actual inductive reasoning as I see it.


And that's actually a really honest answer. Whereas someone of the opposite opinion might be like parroting in the general copying-template sense actually generalizes to all observable behaviours because templating systems can be turing-complete or something like that. It's templates-all-the-way-down, including complex induction as long as there is a meta-template to match on its symptoms it can be chained on.

Induction is a hard problem, but humans can skip infinite compute time (I don't think we have any reason to believe humans have infinite compute) and still give valid answers. Because there's some (meta)-structure to be exploited.

Architecturally if machines / NN can exploit this same structure is a truer question.


> this stuff isn't making a qualitative leap into emergent anything.

The magical missing ingredient here is search. AlphaZero used search to surpass humans, and the whole Alpha family from DeepMind is surprisingly strong, but narrowly targeted. The AlphaProof model uses LLMs and LEAN to solve hard math problems. The same problem solving CoT data is being used by current reasoning models and they have much better results. The missing piece was search.


I'm sure both of you know this, but "stochastic parrot" refers to the title of a research article that contained a particular argument about LLM limitations that had very little to do with parrots.


The term is much more broadly known than the content of that (rather silly) paper.... I'm not even certain that it's the first use of the term.



And the word "hallucination" ... has very little to do with...


But it's far easier for human parrots to parrot the soundbyte "stochastic parrot" as a thought-terminating cliche.


There is definitely a mini cult of people that want to be very right about how everyone else is very wrong about AI.


Firstly this is meta ad hom. You're ignoring the argument to target the speaker(s)

Secondly, you're ignoring the fact that the community of voices with experience in data sciences, computer science and artificial intelligence themselves are split on the qualities or lack of them in current AI. GPT and LLM are very interesting but say little or nothing to me of new theory of mind, or display inductive logic and reasoning, or even meet the bar for a philosophers cave solution to problems. We've been here before so many, many times. "Just a bit more power captain" was very strong in connectionist theories of mind. fMRI brains activity analytics, you name it.

So yes. There are a lot of "us" who are pushing back on the hype, and no we're not a mini cult.


> GPT and LLM are very interesting but say little or nothing to me of new theory of mind, or display inductive logic and reasoning, or even meet the bar for a philosophers cave solution to problems.

The simple fact they can generate language so well makes me think... maybe language itself carries more weight than we originally thought. LLMs can get to this point without personal experience and embodiment, it should not have been possible, but here we are.

I think philosophers are lagging science now. The RL paradigm of agent-environment-reward based learning seems to me a better one than what we have in philiosophy now. And if you look at how LLMs model language as high dimensional embedding spaces .. this could solve many intractable philosophical problems, like the infinite homunculus regress problem. Relational representations straddle the midpoint between 1st and 3rd person, offering a possible path over the hard problem "gap".


There are a couple Twitter personalities that definitely fit this description.

There is also a much bigger group of people that haven't really tried anything beyond GPT-3.5, which was the best you could get without paying a monthly subscription for a long time. One of the biggest reasons for r1 hype, besides the geopolitical angle, was people could actually try a reasoning model for free for the first time.


ie, the people that AI is dumb? Or you are saying I'm in a cult for being pro it - I'm definitely part of that cult - the "we already have agi and you have to contort yourself into a pretzel to believe otherwise" cult. Not sure if there is a leader though.


I didn't realize my post can be interpreted either way. I'll leave it ambiguous, hah. Place your bets I guess.


You think we have AGI? What makes you think that?


By knowing what each of the letters stand for


Well that’s disappointing. It was an extraordinary claim that really interested me.

Thought I was about to be learn!

Instead, I just met an asshole.


When someone says "i'm in the cult that believes X", don't expect a water tight argument for the existence of X.


> If you would listen to most of the people critical of LLMs saying they're a "stochastic parrot" - it should be impossible for them to do better than random on any out of distribution problem. Even just changing one number to create a novel math problem should totally stump them and result in entirely random outputs, but it does not.

You don't seem to understand how they work, they recurse their solution meaning if they have remembered components it parrots back sub solutions. Its a bit like a natural language computer, that way you can get them to do math etc, although the instruction set isn't of a turing language.

They can't recurse sub sub parts they haven't seen, but problems that has similar sub parts can of course be solved, anyone understands that.


> You don't seem to understand how they work

I don't think anyone understands how they work- these type of explanations aren't very complete or accurate. Such explanations/models allow one to reason out what types of things they should be capable of vs incapable of in principle regardless of scale or algorithm tweaks, and those predictions and arguments never match reality and require constant goal post shifting as the models are scaled up.

We understand how we brought them about via setting up an optimization problem in a specific way, that isn't the same at all as knowing how they work.

I tend to think in the totally abstract philosophical sense, independent of the type of model, at the limit of an increasingly capable function approximator trained on an increasingly large and diverse set of real world cause/effect time series data, you eventually develop and increasingly accurate and general predictive model of reality organically within the model. Some model types do have fundamental limits in their ability to scale like this, but we haven't yet found one with these models.

It is more appropriate to objectively test what they can and cannot do, and avoid trying to infer what we expect from how we think they work.


Well we do know pretty much exactly what they do, don't we?

What surprises us is the behaviors coming out of that process.

But surprise isn't magic, magic shouldn't even be on the list of explanations to consider.


Magic wasn’t mentioned here. We don’t understand the emerging behavior, in the sense that we can’t reason well about it and make good predictions about it (which would allow us to better control and develop it).

This is similar to how understanding chemistry doesn’t imply understanding biology, or understanding how a brain works.


Exactly, we don't understand, but we want to believe it's reasoning, which would be magic.


There's no belief or magic required, the word 'reasoning' is used here to refer to an observed capability, not a particular underlying process.

We also don't understand exactly how humans reason, so any claims that humans are capable of reasoning is also mostly an observation about abilities/capabilities.


> I don't think anyone understands how they work

Yes we do, we literally built them.

> We understand how we brought them about via setting up an optimization problem in a specific way, that isn't the same at all as knowing how they work.

You're mistaking "knowing how they work" with "understanding all of the emergent behaviors of them"

If I build a physics simulation, then I know how it works. But that's a separate question from whether I can mentally model and explain the precise way that a ball will bounce given a set of initial conditions within the physics simulation which is what you seem to be talking about.


> You're mistaking "knowing how they work" with "understanding all of the emergent behaviors of them"

By knowing how they work I specifically mean understanding the emergent capabilities and behaviors, but I don't see how it is a mistake. If you understood physics but knew nothing about cars, you can't claim to understand how a car works "simple, it's just atoms interacting according to the laws of physics." That would not let you, e.g. explain its engineering principles or capabilities and limitations in any meaningful way.


We didn't really build them, we do billion-dollar random searches for them in parameter space.


>But I think it's inappropriate to claim that models like R1 are "good at deductive or inductive reasoning" when that is demonstrably not true, they are incapable of even the simplest "out-of-distribution" deductive reasoning:

That's not actually what your link says. The tweet says that it solves the simple problem (that they originally designed to foil base LLMs) so they had to invent harder problems until they found one it could not reliably solve.


Did you see how similar the more complicated problem is? It's nearly the exact same problem.


> But I think it's inappropriate to claim that models like R1 are "good at deductive or inductive reasoning" when that is demonstrably not true, they are incapable of even the simplest "out-of-distribution" deductive reasoning: https://xcancel.com/JJitsev/status/1883158738661691878

Your link says that R1, not all models like R1, fails at generalization.

Of particular note:

> We expose DeepSeek R1 to the variations of AIW Friends problem and compare model behavior to o1-preview, o1-mini and Claude 3.5 Sonnet. o1-preview handles the problem robustly, DeepSeek R1 shows strong fluctuations across variations with distribution very similar to o1-mini.


I'd expect that OpenAI's stronger reasoning models also don't generalize too far outside of the areas they are trained for. At the end of the day these are still just LLMs, trying to predict continuations, and how well they do is going to depend on how well the problem at hand matches their training data.

Perhaps the type of RL used to train them also has an effect on generalization, but choice of training data has to play a large part.


Nobody generalizes too far outside the areas they're trained for. Probably that length, 'far' is shorter with today's state of the art but the presence of failure modes don't mean anything.


The way the authors talk about LLMs really rubs me the wrong way. They spend more of the paper talking up the 'claims' about LLMs that they are going to debunk than actually doing any interesting study.

They came into this with the assumption that LLMs are just a cheap trick. As a result, they deliberately searched for an example of failure, rather than trying to do an honest assessment of generalization capabilities.


What the hype crowd doesn't get is that for most people, "a tool that randomly breaks" is not useful.


The fact that a tool can break or that the company manufacturing that tool lies about its abilities, are annoying but do not imply that the tool is useless.

I experience LLM "reasoning" failure several times a day, yet I find them useful.


>They came into this with the assumption that LLMs are just a cheap trick. As a result, they deliberately searched for an example of failure, rather than trying to do an honest assessment of generalization capabilities.

And lo and behold, they still found a glaring failure. You can't fault them for not buying into the hype.


But it is still dishonest to declare reasoning LLMs a scam simply because you searched for a failure mode.

If given a few hundred tries, I bet I could find an example where you reason poorly too. Wikipedia has a whole list of common failure modes of human reasoning: https://en.wikipedia.org/wiki/List_of_fallacies


Well, given the success rate is no more than 90% in the best cases. You could probably find a failure in about 10 tries. The only exception is o1-preview. And this is just a simple substitution of parameters.


The other day I fed a complicated engineering doc for an architectural proposal at work into R1. I incorporated a few great suggestions into my work. Then my work got reviewed very positively by a large team of senior/staff+ engineers (most with experience at FAANG; ie credibly solid engineers). R1 was really useful! Sorry you don’t like it but I think it’s unfair to say it sucks at reasoning.


Your argument is exactly the kind which makes me think people who claim LLMs are intelligent are trolling.

You are equating things which are not related and do not follow from each other. For example:

- A tool being useful (for particular people and particular tasks) does not mean it is reasoning. A static type checker is pretty fucking useful but is neither intelligent nor reasoning.

- The OP did not say he doesn't like R1, he said he disagrees with the opinion it can reason and with how the company advertises the model.

The fake "sorry" is a form of insult and manipulation.

There are probably more issues with your comment but I am unwilling to invest any more time into arguing with someone unwilling to use reasoning to understand text.


Please don't cross into personal attack and please don't post in the flamewar style, regardless of how wrong someone is or you feel they are. We're trying for the opposite here.

https://news.ycombinator.com/newsguidelines.html


The issue with this approach to moderation is that it targets posts based on visibility of "undesired" behavior instead of severity.

For example, many manipulative tactics (e.g. the fake sorry here, responding to something else than was said, ...) and lying can be considered insults (they literally assume the reader is not smart enough to notice, hence at least as severe as calling someone an idiot) but it's hard for a mod to notice without putting in a lot of effort to understand the situation.

Yet when people (very mildly) punish this behavior by calling it out, they are often noticed by the mod because the call out is more visible.


I hear this argument a lot, but I think it's too complicated. It doesn't explain any more than the simple one does, and has the disadvantage of being self-serving.

The simple argument is that when you write things like this:

> I am unwilling to invest any more time into arguing with someone unwilling to use reasoning

...you're bluntly breaking the rules, regardless of what another commenter is doing, be it subtly or blatantly abusive.

I agree that there are countless varieties of passive-aggressive swipe and they rub me the wrong way too, but the argument that those are "just as bad, merely less visible" is not accurate. Attacking someone else is not justified by a passive-aggressive "sorry", just as it is not ok to ram another vehicle when a driver cuts you off in traffic.


I've thought about this a lot because in the past few years I've noticed a massive uptick in what I call "fake politeness" or "polite insults" - people attacking somebody but taking care to stay below the threshold of when a mod would take action, instead hoping that the other person crosses the threshold. This extends to the real world too - you can easily find videos of people and groups (often protesters and political activists) arguing, insulting each other (covertly and overtly) and hoping the other side crosses a threshold so they can play the victim and get a higher power involved.

The issue is many rules are written as absolute statements which expect some kind of higher power (mods, police, ...) to be the only side to deal punishment. This obviously breaks in many situations - when the higher power is understaffed, when it's corrupt or when there is no higher power (war between nation states).

I would like to see attempts to make rules relative. Treat others how you want to be treated but somebody treating you badly gives you the right to also treat them badly (within reason - proportionally). It would probably lead to conflict being more visible (though not necessarily being more numerous) but it would allow communities to self-police without the learned helplessness of relying on a higher power. Aggressors would gain nothing by provoking others because others would be able to defend themselves.

Doing this is hard, especially at scale. Many people who behave poorly towards others back off when they are treated the same way but there also needs to be a way to deal with those who never back down. When a conflict doesn't resolve itself and mods step in, they should always take into account who started it, and especially if they have a pattern of starting conflict.

There's another related issue - there is a difference between fairness/justice and peace. Those in power often fight for the first on paper but have a much stronger incentive to protect the second.


> people attacking somebody but taking care to stay below the threshold of when a mod would take action, instead hoping that the other person crosses the threshold

I agree, it is a problem—but it is (almost by definition) less of a problem than aggression which does cross the threshold. If every user would give up being overtly abusive for being covertly abusive, that wouldn't be great—but it would be better, not least because we could then raise the bar to make that also unacceptable.

(I'm not sure this analogy is helpful, but to me it's comparable to the difference between physical violence and emotional abuse. Both are bad, but society can't treat them the same way—and that despite the fact emotional abuse can actually be worse in some situtations.)

> somebody treating you badly gives you the right to also treat them badly (within reason - proportionally)

I can tell you why that doesn't work (at least not in a context like HN where my experience is): because everyone overestimates the provocations and abuses done by the other, and underestimates the ones done by themselves. If you say the distortion is 10x in each case, that's a 100x skew in perception [1]

As a result, no matter how badly people are behaving, they always feel like the other person started it and did worse, and always feel justified.

In other words, to have that as a rule would amount to having no rule. In order to be even weakly effective, the rule needs to be: you can't be abusive in comments regardless of what other commenters are doing or you feel they are doing [2].

[1] https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

[2] https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...


> it is (almost by definition) less of a problem than aggression which does cross the threshold

Unless you also take into account scale (how often the person does it or how many other people do it) and second-order effects (people who fall for the manipulation and spread it further or act on it). For this reason, I very much prefer people who insult me honestly and overtly, at least I know where I stand with them and at least other people are less likely to get influenced by them.

> I'm not sure this analogy is helpful

This is actually a very rare occasion when an analogy is helpful. As you point out, the emotional abuse can (often?) be worse. TBH when it "escalates" to being physical, it often is a good thing because it finally 1) gives the target/victim "permission" to ask for help 2) it makes it visible to casual observers, increasing the likelihood of intervention 3) it can leave physical evidence and is easily spotted by witnesses.

(I witnessed a whole bunch of bullying and attempts at bullying at school and one thing that remained constant is that people who fought back (retaliated) were left alone (eventually). It is also an age where physical violence is acceptable and serious injuries were rare (actually I don't recall a single one from fighting). This is why I always encourage people to fight back, not only is it effective but it teaches them individual agency instead of waiting for someone in a position of power to save them.)

> I can tell you why that doesn't work

I appreciate this datapoint (and the fact you are open to discussing it, unlike many mods). I agree that it's often hard to distinguish between mistake and malice. For example I reacted to the individual instance because of similar comments I ran into in the past but I didn't check if the same person is making fallacious arguments regularly or if it was a one-off.

But I also have experiences with good outcomes. One example stands out - a guy used a fallacy when arguing with me, i asked him to not do that, he did it again so i did it twice to him as well _while explaining why I am doing it_. He got angry at first, trying to call me out for doing something I told him not to do, but when I asked him to read it again and pointed out that the justification was right after my message with the fallacy (not post-hoc after being "called out"), he understood and stopped doing it himself. It was as if he wasn't really reading my messages at first but reversing the situation made him pay actual attention.

I think the key is that it was a small enough community that 1) the same people interacted with each other repeatedly and that 2) I explained the justification as part of the retaliation.

Point 1 Will never be possible at the scale of HN, though I would like to see algorithmic approaches to truth and trust instead of upvotes/downvotes which just boil down to agree/disagree. Point 2 can be applied anywhere and if mods decide to step in, it IMO is something they should take into account.

Anyway, thanks for the links, I don't have time to go through other people's arguments rn but I will save it for later as it is good to know this comes up from time to time and I am not completely crazy when I see something wrong with the standard threshold-based approach.

Oh and you didn't say it explicitly but I feel like you understand the difference between rules and right/wrong given your phrasing. That is a very nice thing to see if I am correct (though I have no doubt your phrasing was refined by years or trial and error as to what is effective). In general, I believe it should always be made clear that rules exist for practical reasons, not pretend they are some kind of codification of morality.


Just a quick response to that last point: I totally agree—HN's guidelines are not a moral code. They're just heuristics for (hopefully) producing the the type of website we want HN to be.

Another way of putting it is that the rules aren't moral or ethical—they're just the rules of the game we're trying to play here. Different games naturally have different rules.

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...


How do I know you're reasoning, and not just simulating reasoning (imperfectly)?


"researchers seek to leverage their human knowledge of the ___domain, but the only thing that matters in the long run is the leveraging of computation" - Rich Sutton


This is basically a misrepresentation of that tweet.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: