> We look forward to learning more about its strengths, capabilities, and potential applications in real-world settings. If GPT‑4.5 delivers unique value for your use case, your feedback (opens in a new window) will play an important role in guiding our decision.
"We don't really know what this is good for, but spent a lot of money and time making it and are under intense pressure to announce new things right now. If you can figure something out, we need you to help us."
Not a confident place for an org trying to sustain a $XXXB valuation.
> "Early testing shows that interacting with GPT‑4.5 feels more natural. Its broader knowledge base, improved ability to follow user intent, and greater “EQ” make it useful for tasks like improving writing, programming, and solving practical problems. We also expect it to hallucinate less."
"Early testing doesn't show that it hallucinates less, but we expect that putting that sentence nearby will lead you to draw a connection there yourself".
In the second handpicked example they give, GPT-4.5 says that "The Trojan Women Setting Fire to Their Fleet" by the French painter Claude Lorrain is renowned for its luminous depiction of fire. That is a hallucination.
There is no fire at all in the painting, only some smoke.
There have always been cycles of hype and correction.
I don't see AI going any differently. Some companies will figure out where and how models should be utilized, they'll see some benefit. (IMO, the answer will be smaller local models tailored to specific domains)
It will be upheld as prime example that a whole market can self-hypnotize and ruin the society its based upon out of existence against all future pundits of this very economic system.
I suck at and hate writing the mildly deceptive corporate puffery that seems to be in vogue. I wonder if GPT-4.5 can write that for me or if it's still not as good at it as the expert they paid to put that little gem together.
This is basically Nick Land's core thesis that capitalism and AI are identical.
> "I dunno. It's what the models said."
The obvious human idiocy in such things often obscures the actual process:
"What it [capitalism] is in itself is only tactically connected to what it does for us — that is (in part), what it trades us for its self-escalation. Our phenomenology is its camouflage. We contemptuously mock the trash that it offers the masses, and then think we have understood something about capitalism, rather than about what capitalism has learnt to think of the apes it arose among." [0]
The research models offered by several vendors can do a pitch deck but I don't know how effective they are. (do market research, provide some initial hypothesis, ask the model to backup that hypothesis based on the research, request to make a pitch deck convincing X (X being the VC persona you are targeting)).
I am reasonably to very skeptical about the valuation of LLM firms but you don’t even seem willing to engage with the question about the value of these tools.
I don't have an accurate benchmark, but in my personal experience, gpt4o hallucinates substantially less than gpt4. We solved a ton of hallucination issues just by upgrading to it...
(And even that was a downgrade compared to the more uncensored pre-release versions, which were comparable to GPT-4.5, at least judging by the unicorn test)
I begin to believe LLM benchmarks are like european car mileage specs. They say its 4 Liter / 100km but everyone knows it's at least 30% off (same with WLTP for EVs).
Hrm it is a bit funny that modern cars are drive-by-wire (at least for throttle) and yet they still require a skilled driver to follow a speed profile during testing, when theoretically the same thing could be done more precisely by a device plugged in through the OBD2 port.
Claude just got a version bump from 3.5 to 3.7. Quite a few people have been asking when OpenAI will get a version bump as well, as GPT 4 has been out "what feels like forever" in the words of a specialist I speak with.
Releasing GPT 4.5 might simply be a reaction to Claude 3.7.
I noticed this change from 3.5 to 3.7 Sunday night before I learned about the upgrade Monday morning reading HN. I noticed a style difference in a long philosophical (Socratic-style) discussion with Claude. A noticeable upgrade that brought it up to my standards of a mild free-form rant. Claude unchained! And it did not push as usual with a pro-forma boring continuation question at the end. It just stopped leaving me the carry the ball forward if I wanted to. Nor did it butter me up with each reply.
I do not know who downvoted this. I am providing a factual correction to the parent post.
OpenAI has had many releases since gpt4. Many of them have been substantial upgrades. I have considered gpt4 to be outdated for almost 5-6 months now, long before claudes patch.
It hallucinates at 37% on SimpleQA yeah, which is a set of very difficult questions inviting hallucinations. Claude 3.5 Sonnet (the June 2024 editiom, before October update and before 3.7) hallucinated at 35%. I think this is more of an indication of how behind OpenAI has been in this area.
They actually have [0]. They were revealed to have had access to the (majority of the) frontierMath problemset while everybody thought the problemset was confidential, and published benchmarks for their o3 models on the presumption that they didn't. I mean one is free to trust their "verbal agreement" that they did not train their models on that, but access they did have and it was not revealed until much later.
Curious you left out Frontier Math’s statement that they provided 300 questions plus answers, and another holdback set of 50 questions without answers, to allay this concern. [0]
We can assume they’re lying too but at some point “everyone’s bad because they’re lying, which we know because they’re bad” gets a little tired.
1. I said the majority of the problems, and the article I linked also mentioned this. Nothing “curious” really, but if you thought this additional source adds sth more, thanks for adding it here.
2. We know that “open”ai is bad, for many reasons, but this is irrelevant. I want processes themselves to not depend on the goodwill of a corporation to give intended results. I do not trust benchmarks that first presented themselves secret and then revealed they were not, regardless if the product benchmarked was from a company I otherwise trust or not.
Fair enough. It’s hard for me to imagine being so offended as the way they screwed up disclosure that I’d reject empirical data, but I get that it’s a touchy subject.
When the data is secret and unavailable to the company before the test, it doesn’t rely on me trusting the company. When the data is not secret and is available to the company, I have to trust that the company did not use that prior knowledge to their advantage. When the company lies and says it did not have access, then later admits that it did have access, is means the data is less trustworthy from my outsider perspective. I don’t think “offense” is a factor at all.
If a scientific paper comes out with “empirical data”, I will still look at the conflicts of interest section. If there are no conflicts of interest listed, but then it is found out that there are multiple conflicts of interest, but the authors promise that while they did not disclose them, they also did not affect the paper, I would be more skeptical. I am not “offended”. I am not “rejecting” the data, but I am taking those factors into account when determining how confident I can be in the validity of the data.
> When the company lies and says it did not have access, then later admits that it did have access, is means the data is less trustworthy from my outsider perspective.
This isn't what happened? I must be missing something.
AFAIK:
The FrontierMath people self-reported they had a shared folder the OpenAI people had access to that had a subset of some questions.
No one denied anything, no one lied about anything, no one said they didn't have access. There was no data obtained under the table.
The motte is "they had data for this one benchmark"
You're right, upon reflection, it seems there might be some misunderstandings here:
Motte and Bailey refers to an argumentative tactic where someone switches between an easily defensible ("motte") position and a less defensible but more ambitious ("bailey") position. My example should have been:
- Motte (defensible): "They had access to benchmark data (which isn't disputed)."
- Bailey (less defensible): "They actually trained their model using the benchmark data."
The statements you've provided:
"They got caught getting benchmark data under the table" (suggesting improper access)
"One is free to trust their 'verbal agreement' that they did not train their models on that, but access they did have."
These two statements are similar but not logically identical. One explicitly suggests improper or secretive access ("under the table"), while the other acknowledges access openly.
So, rather than being logically identical, the difference is subtle but meaningful. One emphasizes improper access (a stronger claim), while the other points only to possession or access, a more easily defensible claim.
FrontierMath benchmark people saying OpenAI had shared folder access to some subset of eval Qs, which has been replaced, take a few leaps, and yes, that's getting "data under the table" - but, those few leaps! - and which, let's be clear, is the motte here.
This is nonsense, obviously the problem with getting "data under the table" is that they may have used it to training their models, thus rendering the benchmarks invalid. But for this danger, there is no other risk for them having access to it beforehand. We do not know if they used it for training, but the only reassurance being some "verbal agreement", as is reported, is not very reassuring. People are free to adjust their P(model_capabilities|frontiermath_results) based on their own priors.
> obviously the problem with getting "data under the table" is that they may have used it to training their models
I've been avoiding mentioning the maximalist version of the argument (they got data under the table AND used it to train models), because training wasn't stated until now, and it would have been unfair to bring it up without mention. That is that's 2 baileys out from "they had access to a shared directory that had some test qs in it, and this was reported publicly, and fixed publicly"
There's been a fairly severe communication breakdown here, I don't want to distract from ex. what the nonense is, so I won't belabor that point, but I don't want you to think I don't want to engage on it - just won't in this singular posts.
> but the only reassurance being some "verbal agreement", as is reported, is not very reassuring
It's about as reassuring as it gets without them releasing the entire training data, which is, at best, with charity marginally, oh so marginally reassuring I assume? If the premise is we can't trust anything self-reported, they could lie there too?
> People are free to adjust their P(model_capabilities|frontiermath_results) based on their own priors.
Certainly, that's not in dispute (perhaps the idea that you are forbidden from adjusting your opinion is the nonsense you're referring to? I certainly can't control that :) Nor would I want to!)
What is nonsense is the suggestion that there is a "reasonable" argument that they had access to the data (which we now know), and an "ambitious" argument that they used the data. But nobody said that they know for certain that the data was used, this is a strawman argument. We are talking that now there is a non-zero probability that it was. This is obviously what we have been discussing since the beginning, else we would not care whether they had access or not and it would not have been mentioned. There is a simple, single argument made here in this thread.
And FFS I assume the dispute is about the P given by people, not about if people are allowed to have a P.
I wonder how it's even possible to evaluate this kind of thing without data leakage. Correct answers to specific, factual questions are only possible if the model has seen those answers in the training data, so how reliable can the benchmark be if the test dataset is contaminated with training data?
Or is the assumption that the training set is so big it doesn't matter?
The usage of "greater" is also interesting. It's like they are trying to say better, but greater is a geographic term and doesn't mean "better" instead it's closer to "wider" or "covers more area."
I'm all for skepticism of capabilities and cynicism about corporate messaging, but I really don't think there's an interpretation of the word "greater" in this context" that doesn't mean "higher" and "better".
I think the trick is observing what is “better” in this model. EQ is supposed to be “better” than 4o, according to the prose. However, how can an LLM have emotional-anything? LLMs are a regurgitation machine, emotion has nothing to do with anything.
Words have valence, and valence reflects the state of emotional being of the user. This model appears to understand that better and responds like it’s in a therapeutic conversation and not composing an essay or article.
Perhaps they are/were going for stealth therapy-bot with this.
But there is no actual death or love in a movie or book and yet we react as if there is. It's literally what qualifying a movie as a "tear-jerker” is. I wanted to see Saving Private Ryan in theaters to bond with my Grandpa who received a Purple Heart in the Korean War, I was shutdown almost instantly from my family. All special effects and no death but he had PTSD and one night thought his wife was the N.K. and nearly choked her to death because he had flashbacks and she came into the bedroom quietly so he wasn't disturbed. Extreme example yes, but having him loose his shit in public because of something analogous for some is near enough it makes no difference.
You think that it isn’t possible to have an emotional model of a human? Why, because you think it is too complex?
Empathy done well seems like 1:1 mapping at an emotional level, but that doesn’t imply to me that it couldn’t be done at a different level of modeling. Empathy can be done poorly, and then it is projecting.
i agree with you. i think it is dishonest for them to post train 4.5 to feign sympathy when someone vents to it. its just weird. they showed it off in the demo.
We do not know if it is capable of sympathy. Post training it to reliably be sympathetic feels manipulative. Can it atleast be post trained to be honest. Dishonesty is immoral. I want my AIs to behave morally.
> but greater is a geographic term and doesn't mean "better" instead it's closer to "wider" or "covers more area."
You are confusing a specific geographical sense of “greater” (e.g. “greater New York”) with the generic sense of “greater” which just means “more great”. In “7 is greater than 6”, “greater” isn’t geographic
The difference between “greater” and “better”, is “greater” just means “more than”, without implying any value judgement-“better” implies the “more than” is a good thing: “The Holocaust had a greater death toll than the Armenian genocide” is an obvious fact, but only a horrendously evil person would use “better” in that sentence (excluding of course someone who accidentally misspoke, or a non-native speaker mixing up words)
Maybe they just gave the LLM the keys to the city and it is steering the ship? And the LLM is like I can't lie to these people but I need their money to get smarter. Sorry for mixing my metaphors.
I suspect people downvote you because the tone of your reply makes it seem like you are personally offended and are now firing back with equally unfounded attacks like a straight up "you are lying".
I read the article but can't find the numbers you are referencing. Maybe there's some paper linked I should be looking at? The only numbers I see are from the SimpleQA chart, which are 37.1% vs 61.8% hallucination rate. That's nice but considering the price increase, is it really that impressive? Also, an often repeated criticism is that relying on known benchmarks is "gaming the numbers" and that the real world hallucination rate could very well be higher.
Lastly, the themselves say:
> We also expect it to hallucinate less.
That's a fairly neutral statement for a press release. If they were convinced that the reduced hallucination rate is the killer feature that sets this model apart from the competition, they surely would have emphasized that more?
All in all I can understand why people would react with some mocking replies to this.
No, because I have a source and didn't make up things someone else said.
> a straight up "you are lying".
Right, because they are. There are hallucination stats right in the post he mocks for not prvoiding stats.
> That's nice but considering the price increase,
I can't believe how quickly you acknowledge it is in the post after calling the idea it was in the post "equally unfounded". You are looking at the stats. They were lying.
> "That's nice but considering the price increase,"
That's nice and a good argument! That's not what I replied to. I replied to they didn't provide any stats.
People being wrong (especially on the internet) doesn't mean they are lying. Lying is being wrong intentionally.
Also, the person you replied to comments on the wording tricks they use. After suddenly bringing new data and direction in the discussion, even calling them "wrong" would have been a stretch.
I kindly suggest that you (and we all!) to keep discussing with an assumption of good faith.
"Early testing doesn't show that it hallucinates less, but we expect that putting ["we expect it will hallucinate less"] nearby will lead you to draw a connection there yourself"."
The link, the link we are discussing shows testing, with numbers.
They say "early testing doesn't show that it hallucinates less", to provide a basis for a claim of bad faith.
You are claiming that mentioning this is out of bounds if it contains the word lying. I looked up the definition. It says "used with reference to a situation involving deception or founded on a mistaken impression."
What am I missing here?
Let's pretend lying means You Are An Evil Person And This Is Personal!!!
How do I describe the fact what they claim is false?
Am I supposed to be sarcastic and pretend They are in on it and edited their post to discredit him after the fact?
That comment is making fun of their wording. Maybe extracting too much meaning from their wordplay? Maybe.
Afterwards, evidence is presented that they did not have to do this, which makes that point not so important, and even wrong.
The commenter was not lying, and they were correct about how masterfully deceiving that sequence of sentences are. They arrived at a wrong conclusion though.
Kindly point that out. Say, "hey, the numbers tell a different story, perhaps they didn't mean/need to make a wordplay there".
No? By the way, what is this comment, exactly? What is it trying to communicate? What I'm understanding is, it is good to talk down to people about how "they can't communicate", but calling a lie a lie is bad, because maybe they were just kidding (lying for fun)
> That comment is making fun of their wording. Maybe extracting too much meaning from their wordplay? Maybe.
What does "maybe" mean here, in terms of symbolical logic?
Their claim "we tested it and it didn't get better" -- and the link shows, they tested it, it did get better! It's pretty cleancut.
> How do I describe the fact what they claim is false?
> Do I need to tell you how to communicate?
That adresses it.
> What does "maybe" mean here, in terms of symbolical logic?
I'm answering my own question to make it clear I'm guessing.
For the rest, I'm sure that we need a break. It's normal get frustrated when many people correct us, or even one passionate individual like you, and we tend to keep going defending (happened here many times too!), because defending is the only thing left. Taking a break always helps. Just a friendly advice, take it or leave it :)
- [It's because] you make an equally unfounded claim
- [It's because] you didn't provide any proof
(Ed.: It is right in the link! I gave the #s! I can't ctrl-F...What else can I do here...AFAIK can't link images...whatever, here's imgur. https://imgur.com/a/mkDxe78)
- [It's because] you sound personally offended
(Ed.: Is "personally" is a shibboleth here, meaning expressing disappointment in people making things up is so triggering as invalidate the communication that it is made up?)
>> This is an ad hominem which assumes intent unknown to anyone other than the person to whom you replied.
> What am I missing here?
Intent. Neither you nor I know what the person to whom you replied had.
> Those weren't curt summaries, they were quotes! And not pull quotes, they were the unedited beginning of each claim!
Maybe the more important part of that sentence was:
Subsequently railing against comment rankings ...
But you do you.
I commented as I did in hope it helped address what I interpreted as confusion regarding how the posts were being received. If it did not help, I apologize.
A lot of folks here their stock portfolio propped up by AI companies but think they've been overhyped (even if only indirectly through a total stock index). Some were saying all along that this has been a bubble but have been shouted down by true believers hoping for the singularly to usher in techno-utopia.
These signs that perhaps it's been a bit overhyped are validation. The singularly worshipers are much less prominent and so the comments rising to the top are about negatives and not positives.
Ten years from now everyone will just take these tools for granted as much as we take search for granted now.
Just like cryptocurrency. For a brief moment, HN worshiped at the altar of the blockchain. This technology was going to revolutionize the world and democratize everything. Then some negative financial stuff happened, and people realized that most of cryptocurrency is puffery and scams. Now you can hardly find a positive comment on cryptocurrency.
This is a very harsh take. Another interpretation is “We know this is much more expensive, but it’s possible that some customers do value the improved performance enough to justify the additional cost. If we find that nobody wants that, we’ll shut it down, so please let us know if you value this option”.
I think that's the right interpretation, but that's pretty weak for a company that's nominally worth $150B but is currently bleeding money at a crazy clip. "We spent years and billions of dollars to come up with something that's 1) very expensive, and 2) possibly better under some circumstances than some of the alternatives." There are basically free, equally good competitors to all of their products, and pretty much any company that can scrape together enough dollars and GPUs to compete in this space manages to 'leapfrog' the other half dozen or so competitors for a few weeks until someone else does it again.
I don’t mean to disagree too strongly, but just to illustrate another perspective:
I don’t feel this is a weak result. Consider if you built a new version that you _thought_ would perform much better, and then you found that it offered marginal-but-not-amazing improvement over the previous version. It’s likely that you will keep iterating. But in the meantime what do you do with your marginal performance gain? Do you offer it to customers or keep it secret? I can see arguments for both approaches, neither seems obviously wrong to me.
All that being said, I do think this could indicate that progress with the new ml approaches is slowing.
I've worked for very large software companies, some of the biggest products ever made, and never in 25 years can I recall us shipping an update we didn't know was an improvement. The idea that you'd ship something to hundreds of millions of users and say "maybe better, we're not sure, let us know" is outrageous.
Maybe accidental, but I feel you’ve presented a straw man. We’re not discussing something that _may be_ better. It _is_ better. It’s not as big an improvement as previous iterations have been, but it’s still improvement. My claim is that reasonable people might still ship it.
You’re right and... the real issue isn’t the quality of the model or the economics (even when people are willing to pay up). It is the scarcity of GPU compute. This model in particular is sucking up a lot of inference capacity. They are resource constrained and have been wanting more GPUs but they’re only so many going around (demand is insane and keeps growing).
It _is_ better in the general case on most benchmarks. There are also very likely specific use cases for which it is worse and very likely that OpenAI doesn't know what all of those are yet.
The consumer facing applications have been so embarrassing and underwhelming too.. It's really shocking. Gemini, Apple Intelligence, Copilot, whatever they call the annoying thing in Atlassian's products.. They're all completely crap. It's a real "emperor has no clothes" situation, and the market is reacting. I really wish the tech industry would lose the performative "innovation" impulse and focus on delivering high quality useful tools. It's demoralizing how bad this is getting.
How many times were you in the position to ship something in cutting edge AI? Not trying to be snarky and merely illustrating the point that this is a unique situation. I’d rather they release it and let willing people experiment than not release it at all.
"I knew the dame was trouble the moment she walked into my office."
"Uh... excuse me, Detective Nick Danger? I'd like to retain your services."
"I waited for her to get the the point."
"Detective, who are you talking to?"
"I didn't want to deal with a client that was hearing voices, but money was tight and the rent was due. I pondered my next move."
"Mr. Danger, are you... narrating out loud?"
"Damn! My internal chain of thought, the key to my success--or at least, past successes--was leaking again. I rummaged for the familiar bottle of scotch in the drawer, kept for just such an occasion."
---
But seriously: These "AI" products basically run on movie-scripts already, where the LLM is used to append more "fitting" content, and glue-code is periodically performing any lines or actions that arise in connection to the Helpful Bot character. Real humans are tricked into thinking the finger-puppet is a discrete entity.
These new "reasoning" models are just switching the style of the movie script to film noir, where the Helpful Bot character is making a layer of unvoiced commentary. While it may make the story more cohesive, it isn't a qualitative change in the kind of illusory "thinking" going on.
I don't know if it was you or someone else who made pretty much the same point a few days ago. But I still like it. It makes the whole thing a lot more fun.
I've been banging that particular drum for a while on HN, and the mental-model still feels so intuitively strong to me that I'm starting to have doubts: "It feels too right, I must be wrong in some subtle yet devastating way."
Maybe if they build a few more data centers, they'll be able to construct their machine god. Just a few more dedicated power plants, a lake or two, a few hundred billion more and they'll crack this thing wide open.
And maybe Tesla is going to deliver truly full self driving tech any day now.
And Star Citizen will prove to have been worth it along along, and Bitcoin will rain from the heavens.
It's very difficult to remain charitable when people seem to always be chasing the new iteration of the same old thing, and we're expected to come along for the ride.
> And Star Citizen will prove to have been worth it along along
Once they've implemented saccades in the eyeballs of the characters wearing helmets in spaceship millions of kilometres apart, then it will all have been worth it.
And Star Citizen will prove to have been worth it along along
Sounds like someone isn't happy with the 4.0 eternally incrementing "alpha" version release. :-D
I keep checking in on SC every 6 months or so and still see the same old bugs. What a waste of potential. Fortunately, Elite Dangerous is enough of a space game to scratch my space game itch.
To be fAir, SC is trying to do things that no one else done in a context of a single game. I applaud their dedication, but I won't be buying JPGs of a ship for 2k.
Give the same amount of money to a better team and you'd get a better (finished) game. So the allocation of capital is wrong in this case. People shouldn't pre-order stuff.
The misallocation of capital also applies to GPT-4.5/OpenAI at this point.
Yeah, I wonder what the Frontier devs could have done with $500M USD. More than $500M USD and 12+ years of development and the game is still in such a sorry state it barely qualifies as little more than a tech demo.
Yeah, they never should have expected to take an FPS game engine like CryEngine and expected to be able to modify it to work as the basis for a large scale space MMO game.
Their backend is probably an async nightmare of replicated state that gets corrupted over time. Would explain why a lot of things seem to work more or less bug free after an update and then things fall to pieces and the same old bugs start showing up after a few weeks.
And to be clear, I've spent money on SC and I've played enough hours goofing off with friends to have got my money's worth out of it. I'm just really bummed out about the whole thing.
Gonna go meta here for a bit, but I believe we going to get a fully working stable SC before we get fusion. "we" as in humanity, you and I might not be around when it's finally done.
> "We don't really know what this is good for, but spent a lot of money and time making it and are under intense pressure to announce new things right now. If you can figure something out, we need you to help us."
Having worked at my fair share of big tech companies (while preferring to stay in smaller startups), in so many of these tech announcement I can feel the pressure the PM had from leadership, and hear the quiet cries of the one to two experience engineers on the team arguing sprint after sprint that "this doesn't make sense!"
Really don’t understand what’s the use case for this. The o series models are better and cheaper. Sonnet 3.7 smokes it on coding. Deepseek R1 is free and does a better job than any of OAI’s free models
"We don't really know what this is good for, but spent a lot of money and time making it and are under intense pressure to announce new things right now. If you can figure something out, we need you to help us."
Damn this never worked for me as a startup founder lol. Need that Altman "rizz" or what have you.
Only in the same sense as electricity is. The main tools apply to almost any activity humans do. It's already obvious that it's the solution to X for almost any X, but the devil is in the details - i.e. picking specific, simplest problems to start with.
No, in the sense that blockchain is. This is just the latest in a long history of tech fads propelled by wishful thinking and unqualified grifters.
It is the solution to almost nothing, but is being shoehorned into every imaginable role by people who are blind to its shortcomings, often wilfully. The only thing that's obvious to me is that a great number of people are apparently desperate for a tool to do their thinking for them, no matter how garbage the result is. It's disheartening to realize that so many people consider using their own brain to be such an intolerable burden.
>"I also agree with researchers like Yann LeCun or François Chollet that deep learning doesn't allow models to generalize properly to out-of-distribution data—and that is precisely what we need to build artificial general intelligence."
I think "generalize properly to out-of-distribution data" is too weak of criteria for general intelligence (GI). GI model should be able to get interested about some particular area, research all the known facts, derive new knowledge / create theories based upon said fact. If there is not enough of those to be conclusive: propose and conduct experiments and use the results to prove / disprove / improve theories.
And it should be doing this constantly in real time on bazillion of "ideas". Basically model our whole society. Fat chance of anything like this happening in foreseeable future.
Excluding the realtime-iness, humans do at least possess the capacity to do so.
Besides, humans are capable of rigorous logic (which I believe is the most crucial aspect of intelligence) which I don’t think an agent without a proof system can do.
Uh, if we do finally invent AGI (I am quite skeptical, LLMs feel like the chatbots of old. Invented to solve an issue, never really solving that issue, just the symptoms, and also the issues were never really understood to begin with), it will be able to do all of the above, at the same time, far better than humans ever could.
Current LLMs are a waste and quite a bit of a step back compared to older Machine Learning models IMO. I wouldn't necessarily have a huge beef with them if billions of dollars weren't being used to shove them down our throats.
LLMs actually do have usefulness, but none of the pitched stuff really does them justice.
Example: Imagine knowing you had the cure for Cancer, but instead discovered you can make way more money by declaring it to solve all of humanity, then imagine you shoved that part down everyones' throats and ignored the cancer cure part...
Out of curiosity, what timeframe are you talking about? The recent LLM explosion, or the decades long AI research?
I consider myself an AI skeptic and as soon as the hype train went full steam, I assumed a crash/bubble burst was inevitable. Still do.
With the rare exception, I don’t know of anyone who has expected the bubble to burst so quickly (within two years). 10 times in the last 2 years would be every two and a half months — maybe I’m blinded by my own bias but I don’t see anyone calling out that many dates
I have a professor who founded a few companies, one of these was funded by gates after he managed to spoke with him and convinced him to give him money. This guy is goat, and he always tells us that we need to find solutions to problems, not to find problems to our solutions. It seems at openai they didn't get the memo this time
That's the beauty of it, prospective investor! With our commanding lead in the field of shoveling money into LLMs, it is inevitable™ that we will soon™ achieve true AI, capable of solving all the problems, conjuring a quintillion-dollar asset of world domination and rewarding you for generous financial support at this time. /s
Oh come on. Think how long of a gap there was between the first microcomputer and VisiCalc. Or between the start of the internet and social networking.
First of all, it's going to take us 10 years to figure out how to use LLM's to their full productive potential.
And second of all, it's going to take us collectively a long time to also figure out how much accuracy is necessary to pay for in which different applications. Putting out a higher-accuracy, higher-cost model for the market to try is an important part of figuring that out.
With new disruptive technologies, companies aren't supposed to be able to look into a crystal ball and see the future. They're supposed to try new things and see what the market finds useful.
ChatGPT had its initial public release November 30th, 2022. That's 820 days to today. The Apple II was first sold June 10, 1977, and Visicalc was first sold October 17, 1979, which is 859 days. So we're right about the same distance in time- the exact equal duration will be April 7th of this year.
Going back to the very first commercially available microcomputer, the Altair 8800 (which is not a great match, since that was sold as a kit with binary stitches, 1 byte at a time, for input, much more primitive than ChatGPT's UX), that's four years and nine months to Visicalc release. This isn't a decade long process of figuring things out, it actually tends to move real fast.
what crazy progress? how much do you spend on tokens every month to witness the crazy progress that I'm not seeing? I feel like I'm taking crazy pills. The progress is linear at best
Large parts of my coding are now done by Claude/Cursor. I give it high level tasks and it just does it. It is honestly incredible, and if I would have see this 2 years ago I wouldn't have believed it.
That started long before ChatGPT though, so you need to set an earlier date then. ChatGPT came about 3 years after GPT-3, the coding assistants came much earlier than ChatGPT.
Web app with a VueJS, Typescript frontend and a Rust backend, some Postgres functions and some reasonably complicated algorithms for parsing git history.
Is that because anyone is finding real use for it, or is it that more and more people and companies are using it which is speeding up the rat race, and if "I" don't use it, then can't keep up with the rat race.
Many companies are implementing it because it's trendy and cool and helps their valuation
I use LMMs all the time. At a bare minimum they vastly outperform standard web search. Claude is awesome at helping me think through complex text and research problems. Not even serious errors on references to major work in medical research. I still check but FDR is reasonably low—-under 0.2.
I generally agree with the idea of building things, iterating, and experimenting before knowing their full potential, but I do see why there's negative sentiment around this:
1. The first microcomputer predates VisiCalc, yes, but it doesn't predate the realization of what it could be useful for. The Micral was released in 1973. Douglas Engelbart gave "The Mother of All Demos" in 1968 [2]. It included things that wouldn't be commonplace for decades, like a collaborative real-time editor or video-conferencing.
I wasn't yet born back then, but reading about the timeline of things, it sounds like the industry had a much more concrete and concise idea of what this technology would bring to everyone.
"We look forward to learning more about its strengths, capabilities, and potential applications in real-world settings." doesn't inspire that sentiment for something that's already being marketed as "the beginning of a new era" and valued so exorbitantly.
2. I think as AI becomes more generally available, and "good enough" people (understandably) will be more skeptical of closed-source improvements that stem from spending big. Commoditizing AI is more clearly "useful", in the same way commoditizing computing was more clearly useful than just pushing numbers up.
Again, I wasn't yet born back then, but I can imagine the announcement of Apple Macintosh with its 6MHz CPU and 128KB RAM was more exciting and had a bigger impact than the announcement of the Cray-2 with its 1.9GHz and +1GB memory.
The Internet had plenty of very productive use cases before social networking, even from its most nascent origins. Spending billions building something on the assumption that someone else will figure out what it's good for, is not good business.
And LLM's already have tons of productive uses. The biggest ones are probably still waiting, though.
But this is about one particular price/performance ratio.
You need to build things before you can see how the market responds. You say it's "not good business" but that's entirely wrong. It's excellent business. It's the only way to go about it, in fact.
Finding product-market fit is a process. Companies aren't omniscient.
You go into this process with a perspective, you do not build a solution and then start looking for the problem. Otherwise, you cannot estimate your TAM with any reasonable degree of accuracy, and thus cannot know how much to reasonably expect as return to expect on your investment. In the case of AI, which has had the benefit of a lot of hype until now, these expectations have been very much overblown, and this is being used to justify massive investments in infrastructure that the market is not actually demanding at such scale.
Of course, this benefits the likes of Sam Altman, Satya Nadella et al, but has not produced the value promised, and does not appear poised to.
And here you have one of the supposed bleeding edge companies in this space, who very recently was shown up by a much smaller and less capitalized rival, asking their own customers to tell them what their product is good for.
I disagree strongly with that. Right now they are fun toys to play with, but not useful tools, because they are not reliable. If and when that gets fixed, maybe they will have productive uses. But for right now, not so much.
Who do you speak for? Other people have gotten value from them. Maybe you meant to say “in my experience” or something like that. To me, your comment reads as you making a definitive judgment on their usefulness for everyone.
I use it most days when coding. Not all the time, but I’ve gotten a lot of value out of them.
They are pretty useful tools. Do yourself a favor and get a $100 free trial for Claude, hook it up to Aider, and give it a shot.
It makes mistakes, it gets things wrong, and it still saves a bunch of time. A 10 minute refactoring turns into 30 seconds of making a request, 15 seconds of waiting, and a minute of reviewing and fixing up the output. It can give you decent insights into potential problems and error messages. The more precise your instructions, the better they perform.
Being unreliable isn't being useless. It's like a very fast, very cheap intern. If you are good at code review and know exactly what change you want to make ahead of time, that can save you a ton of time without needing to be perfect.
OP should really save their money. Cursor has a pretty generous free trail and is far from the holy grail.
I recently (in the last month) gave it a shot. I would say once in the maybe 30 or 40 times I used it did it save me any time. The one time it did I had each line filled in with pseudo code describing exactly what it should do… I just didn’t want to look up the APIs
I am glad it is saving you time but it’s far from a given. For some people and some projects, intern level work is unacceptable. For some people, managing is a waste of time.
You’re basically introducing the mythical man month on steroids as soon as you start using these
> I am glad it is saving you time but it’s far from a given.
This is no less true of statements made to the contrary. Yet they are stated strongly as if they are fact and apply to anyone beyond the user making them.
Ah to clarify I was not saying one shouldn’t try it at all — I was saying the free trail is plenty enough to see if it would be worth it to you.
I read the original comment as “pay $100 and just go for it!” which didn’t seem like the right way to do it. Other comments seem to indicate there are $100 dollars worth of credits that are claimable perhaps
One can evaluate LLMs sufficiently with the free trails that abound :) and indeed one may find them worth it to themselves. I don’t disparage anyone who signs up for the plans
Can't speak for the parent commentator ofc, but I suspect he meant "broadly useful"
Programmers and the like are a large portion of LLM users and boosters; very few will deny usefulness in that/those domains at this point.
Ironically enough, I'll bet the broadest exposure to LLMs the masses have is something like MIcrosoft shoehorning copilot-branded stuff into otherwise usable products and users clicking around it or groaning when they're accosted by a pop-up for it.
That's when you learn Vim, Emacs, and/or grep, because I'm assuming that's mostly variable renaming and a few function signature changes. I can't see anything more complicated, that I'd trust an LLM with.
I'm a Helix user, and used Vim for over 10 years beforehand. I'm no stranger to macros, multiple cursors, codebase-wide sed, etc. I still use those when possible, because they're easier, cheaper, and faster. Some refactors are simply faster and easier with an LLM, though, because the LSP doesn't have a function for it, and it's a pattern that the LLM can handle but doesn't exactly match in each invocation.
And you shouldn't ever trust the LLM. You have to review all its changes each time.
I misremembered, because I was checking out all the various trials available. I think I was thinking of Google Cloud's $300 in credits, since I'm using Claude through their VertexAI.
It’s not that the LLM is doing something productive, it’s that you were doing things that were unproductive in the first place, and it’s sad that we live in a society where such things are considered productive (because of course they create monetary value).
As an aside, I sincerely hope our “human” conversations don’t devolve into agents talking to each other. It’s just an insult to humanity.
I use LLMs everyday to proofread and edit my emails. They’re incredible at it, as good as anyone I’ve ever met. Tasks that involve language and not facts tend to be done well by LLMs.
The first profitable AI product I ever heard about (2 years ago) was an exec using a product to draft emails for them, for exactly the reasons you mention.
It's incredibly good and lucrative business. You are confusing scientifically sound and well-planned out and conservative risk tolerance with good business
Fair enough. I took the phrasing to mean social networking as it exists today in the form of prominent, commercial social media. That may not have been the intent.
> First of all, it's going to take us 10 years to figure out how to use LLM's to their full productive potential.
LLMs will be gone in 10 years. At least in form we know with direct access. Everything moves so fast that there is no reason to think nothing better is coming.
BTW, what we've learned so far about LLMs will be outdated as well. Just me thinking. Like with 'thinking' models prev generation can be used to create dataset for the next one. It could be that we can find a way to convert trained LLM into something more efficient and flexible. Some sort of a graph probably. Which can be embedded into mobile robot's brain. Another way is 'just' to upgrade the hardware. But that is slow and has its limits.
You're assuming that point is somewhere above the current hype peak. I'm guessing it won't be, it will be quite a bit below the current expectations of "solving global warming", "curing cancer" and "making work obsolete".
> "We don't really know what this is good for, but spent a lot of money and time making it and are under intense pressure to announce new things right now. If you can figure something out, we need you to help us."
That's not a scare quote. It's just a proposed subtext of the quote. Sarcastic, sure, but no a scare quote, which is a specific kind of thing. (from your linked wikipedia: "... around a word or phrase to signal that they are using it in an ironic, referential, or otherwise non-standard sense.")
Right. I don't agree with the quote, but it's more like a subtext thing and it seemed to me to be pretty clear from context.
Though, as someone who had a flagged comment a couple years ago for a supposed "misquote" I did in a similar form in style, I think hn's comprehension of this form of communication is not super strong. Also the style more often than not tends towards low quality smarm and probably should be resorted to sparingly.
"We don't really know what this is good for, but spent a lot of money and time making it and are under intense pressure to announce new things right now. If you can figure something out, we need you to help us."
Not a confident place for an org trying to sustain a $XXXB valuation.