I suppose this was their final hurrah after two failed attempts at training GPT-...

crystal_revenge · 2025-02-28T01:10:46 1740705046

> Just confirms reasoning models are the only way forward.

Reasoning models are roughly the equivalent to allow Hamiltonian Monte-Carlo models to "warm up" (i.e. start sampling from the typical set). This, unsurprisingly, yields better results (after all LLMs are just fancy Monte-carlo models in the end). However, it is extremely unlikely this improvement is without pretty reasonable limitations. Letting your HMC warm up is essential to good sampling, but letting "warm up more" doesn't result in radically better sampling.

While there have been impressive results in efficiency of sampling from the typical set seen in LLMs these days, we're clearly not making the major improvements in the capabilities of these models.

int_19h · 2025-02-28T10:04:08 1740737048

Reasoning models can solve tasks that non-reasoning ones were unable to; how is that not an improvement? What constitutes "major" is subjective - if a "minor" improvement in overall performance means that the model can now successfully perform a task it was unable to solve before, that is a major advancement for that particular task.

granzymes · 2025-02-27T20:38:28 1740688708

> Compared to OpenAI o1 and OpenAI o3‑mini, GPT‑4.5 is a more general-purpose, innately smarter model. We believe reasoning will be a core capability of future models, and that the two approaches to scaling—pre-training and reasoning—will complement each other. As models like GPT‑4.5 become smarter and more knowledgeable through pre-training, they will serve as an even stronger foundation for reasoning and tool-using agents.

DebtDeflation · 2025-02-27T21:28:31 1740691711

GPT 5 is likely just going to be a router model that decides whether to send the prompt to 4o, 4o mini, 4.5, o3, or o3 mini.

swores · 2025-02-27T23:42:46 1740699766

My guess is that you're right about that being what's next (or maybe almost next) from them, but I think they'll save the name GPT-5 for the next actually-trained model (like 4.5 but a bigger jump), and use a different kind of name for the routing model.

Even by their poor standards at naming it would be weird to introduce a completely new type/concept, that can loop in models including the 4 / 4.5 series, while naming it part of that same series.

My bet: probably something weird like "oo1", or I suspect they might try to give it a name that sticks for people to think of as "the" model - either just calling it "ChatGPT", or coming up with something new that sounds more like a product name than a version number (OpenCore, or Central, or... whatever they think of)

JohnnyMarcone · 2025-02-28T02:29:00 1740709740

They already confirmed GPT-5 will be a unified model "months" away. Elsewhere they claimed that it will not just be a router but a "unified" model.

https://www.theverge.com/news/611365/openai-gpt-4-5-roadmap-...

DebtDeflation · 2025-02-28T09:30:56 1740735056

If you read what sama is quoted as saying in your link, it's obvious that "unified model" = router.

> “We hate the model picker as much as you do and want to return to magic unified intelligence,”

> “a top goal for us is to unify o-series models and GPT-series models by creating systems that can use all our tools, know when to think for a long time or not, and generally be useful for a very wide range of tasks,”

> the company plans to “release GPT-5 as a system that integrates a lot of our technology, including o3,”

He even slips up and says "integrates" in the last quote.

When he talks about "unifying", he's talking about the user experience not the underlying model itself.

swores · 2025-02-28T10:16:06 1740737766

Interesting, thanks for sharing - definitely makes me withdraw my confidence in that prediction, though I still think there's a decent chance they change their mind about that as it seems to me like an even worse naming decision than their previous shit name choices!

lolinder · 2025-02-28T00:54:48 1740704088

Except minus 4.5, because at these prices and results there's essentially no reason not to just use one of the existing models if you're going to be dynamically routing anyway.

jstummbillig · 2025-02-27T20:53:32 1740689612

What it confirms, I think, is, that we are going to need a lot more chips.

georgemcbay · 2025-02-27T22:44:21 1740696261

Further confirmation, IMO, that the idea that any of this leads to anything close to AGI is people getting high on their own supply (in some cases literally).

LLMs are a great tool for what is effectively collected knowledge search and summary (so long as you are willing to accept that you have to verify all of the 'knowledge' they spit back because they always have the ability to go off the rails) but they have been hitting the limits on how much better that can get without somehow introducing more real knowledge for close to 2 years now and everything since then is super incremental and IME mostly just benchmark gains and hype as opposed to actually being purely better.

I personally don't believe that more GPUs solves this, like, at all. But its great for Nvidia's stock price.

BobbyJo · 2025-02-28T00:27:02 1740702422

I'd put myself on the pessimistic side of all the hype, but I still acknowledge that where we are now is a pretty staggering leap from two years ago. Coding in particular has gone from hints and fragments to full scripts that you can correct verbally and are very often accurate and reliable.

georgemcbay · 2025-02-28T01:12:34 1740705154

I'm not saying there's been no improvement at all. I personally wouldn't categorize it as staggering, but we can agree to disagree on that.

I find the improvements to be uneven in the sense that every time I try a new model I can find use cases where its an improvement over previous versions but I can also find use cases where it feels like a serious regression.

Our differences in how we categorize the amount of improvement over the past 2 years may be related to how much the newer models are improving vs regressing for our individual use cases.

When used as coding helpers/time accelerators, I find newer models to be better at one-shot tasks where you let the LLM loose to write or rewrite entire large systems and I find them worse at creating or maintaining small modules to fit into an existing larger system. My own use of LLMs is largely in the latter category.

To be fair I find the current peak model for coding assistant to be Claude 3.5 Sonnet which is much newer than 2 years old, but I feel like the improvements to get to that model were pretty incremental relative to the vast amount of resources poured into it and then I feel like Claude 3.7 was a pretty big back-slide for my own use case which has recently heightened my own skepticism.

infecto · 2025-02-28T12:25:40 1740745540

Hilarious. Over two years we went from LLMs being slow and not very capable of solving problems to models that are incredibly fast, cheap and able to solve problems in different domains.

pseufaux · 2025-02-28T00:06:49 1740701209

Well said. 100% agree

prisenco · 2025-02-27T21:11:14 1740690674

Or, possibly, we're stuck waiting for another theoretical breakthrough before real progress is made.

resource0x · 2025-02-27T21:29:42 1740691782

breakthrough in biology

DannyBee · 2025-02-27T22:55:11 1740696911

Eh, no. More chips won't save this right now, or probably in the near future (IE barring someone sitting on a breakthrough right now).

It just means either

A. Lots and lots of hard work that get you a few percent at a time, but add up to a lot over time.

or

B. Completely different approaches that people actually think about for a while rather than trying to incrementally get something done in the next 1-2 months.

Most fields go through this stage. Sometimes more than once as they mature and loop back around :)

Right now, AI seems bad at doing either - at least, from the outside of most of these companies, and watching open source/etc.

While lots of little improvements seem to be released in lots of parts, it's rare to see anywhere that is collecting and aggregating them en masse and putting them in practice. It feels like for every 100 research papers, maybe 1 makes it into something in a way that anyone ends up using it by default.

This could be because they aren't really even a few percent (which would be yet a different problem, and in some ways worse), or it could be because nobody has cared to, or ...

I'm sure very large companies are doing a fairly reasonable job on this, because they historically do, but everyone else - even frameworks - it's still in the "here's a million knobs and things that may or may not help".

It's like if compilers had no "O0/O1/O2/O3' at all and were just like "here's 16,283 compiler passes - you can put them in any order and amount you want". Thanks! I hate it!

It's worse even because it's like this at every layer of the stack, whereas in this compiler example, it's just one layer.

At the rate of claimed improvements by papers in all parts of the stack, either lots and lots and lots is being lost because this is happening, in which case, eventually that percent adds up to enough for someone to be able to use to kill you, or nothing is being lost, in which case, people appear to be wasting untold amounts of time and energy, then trying to bullshit everyone else, and the field as a whole appears to be doing nothing about it. That seems, in a lot of ways, even worse. FWIW - I already know which one the cynics of HN believe, you don't have to tell me :P. This is obviously also presented as black and white, but the in-betweens don't seem much better.

Additionally, everyone seems to rush half-baked things to try to get the next incremental improvement released and out the door because they think it will help them stay "sticky" or whatever. History does not suggest this is a good plan and even if it was a good plan in theory, it's pretty hard to lock people in with what exists right now. There isn't enough anyone cares about and rushing out half-baked crap is not helping that. mindshare doesn't really matter if no one cares about using your product.

Does anyone using these things truly feel locked into anyone's ecosystem at this point? Do they feel like they will be soon?

I haven't met anyone who feels that way, even in corps spending tons and tons of money with these providers.

The public companies - i can at least understand given the fickleness of public markets. That was supposed to be one of the serious benefit of staying private. So watching private companies do the same thing - it's just sort of mind-boggling.

Hopefully they'll grow up soon, or someone who takes their time and does it right during one of the lulls will come and eat all of their lunches.

gniv · 2025-02-28T05:49:37 1740721777

> Completely different approaches that people actually think about for a while

I think this is very likely simply because there are so many smart people looking at it right now. I hope the bubble doesn't burst before it happens.

usaar333 · 2025-02-27T21:19:51 1740691191

For OpenAI perhaps? Sonnet 3.7 without extended thinking is quite strong. Swe-bench scores tie o3

stavros · 2025-02-27T23:02:27 1740697347

How do you read those scores? I wanted to see how well 3.7 with thinking did, but I can't even read that table.

newfocogi · 2025-02-27T20:34:34 1740688474

I think this is the correct take. There are other axes to scale on AND I expect we'll see smaller and smaller models approach this level of pre-trained performance. But I believe massive pre-training gains have hit clearly diminished returns (until I see evidence otherwise).