Honestly I'm actually kind of impressed the ChatGPT (and its competitors) are as accurate as they are, considering that they are still kind of fundamentally fancy-autocomplete [1]. The fact that when I ask a question to ChatGPT and it's generally "accurate enough" is really quite impressive.
I'm not claiming it's perfect, and like Tim Cook I'm not 100% sure it's actually possible to get to a 0% error rate, particularly considering it's trained on data written by humans on the internet, and humans write a lot of really dumb shit on the internet all the time, and human curation can't possibly stop all of it. It's easy to tell the bot to not scrape obviously sketchy sites like Infowars or something, and maybe block some specific subreddits, but there's still always going to be very dumb people posting very dumb opinions [2] in the "mainstream" areas as well.
It's not weird to see really stupid stuff on actual news websites like CNN talking about "ghost sightings", and I'm not sure how you correct for that in your training. You could block CNN, but that's a pretty big repository of news that you're losing training on, and moreover how do you actually block a news website impartially?
[1] Not to undermine the cool stuff that's being done in the space, just my rough understanding of how the algo actually works.
But the mistakes made will be different, and historically you’d have a source to consider. Here there is just one global source telling you to add glue to your pizza to make the cheese stick.
Nope, that’s not how it works. Those references aren’t generated in such systems, they are retrieved. They might not provide references to all the sources, of course, same as humans.
Exactly. Right now if I google something (ai overview aside), I’m linked to a source. That source may or may not include its sources, but its provenance tells me a lot. If I’m reading info linked off Mayo Clinic, their reputation compels the information to be judged of high quality. If they start putting in a bunch of garbage, their reputation gets shot and will cause me to look elsewhere. With LLMs there is no such choice, and it will spew everything from high to low quality (to dangerously wrong) info.
An LLM itself cannot provide references with any integrity. They’re autogressive probabilistic models. They’ll happily make something up, and you can even try and train a reference with it, but as the article states this is very very far from a guarantee. What you can do is a kind of RAG situation where you have some existing database that you include into the prompt to ground it, but that’s not inherent into the model itself.
All outputs of an LLM are generated by LLM. LLMs today can and do use external data sources. Applied to humans, what you're saying is like saying that it's not humans who provide references because they copy the bibtex for them from Arxiv.
But if you’re using an external data source and putting it into the context then it’s the external data source that’s providing the reference — the LLM is just asked to regurgitate it. The large language model, pretrained on trillions of tokens of text, is unable to provide those references.
If I take llama3, for example, and ask it to provide a reference.. it will just make something up. Sometimes these things happen to exist, often times they don’t. And that’s the fundamental problem - they hallucinate. This is well understood.
It’s encouraging that we can get domains like aviation to 5 or 6 9s of reliability with human scale systemization. To me, that implies we can get there with AI systems at least in certain domains.
Could you elaborate on why that implication might be true?
I can think of a counter-point:
Humans can deal with 'error-cases' pretty readily. The effort of humans to get those last few 9's might be a linear effort. For example, one extra checklist might be the thing that was needed, or simply adding a whole extra human to avoid a problem. OTOH, computers to get that last 0.001% correct might need magnitudes more effort to get right. The effort of humans vs computers does not scale at the same rate. Why therefore should we think that something that humans can do with 2x effort would not require 2000x better AI?
There are certainly cases where the inverse would be true. Where human effort to get better reliability would be more than what is needed for a computer (monitoring and control systems are good examples; eg: radar operators, nuclear power plants). Though, in cases where computer effort scales better than human effort - it's very likely those efforts have been automated already.
That high level of reliability in aviation is likely thanks in part due to automation of tasks that computers are good at.
Even if it does, that puts AI ahead in what, 22 years with 2x improvements every 2 years? The simple problem with us humans, we haven't notably improved in inteligence for the last 100.000 years and we'll be beat eventually. It's not even a question barring some end of the world event, we already know it's completely possible because we are living proof that 20 W can be at least this smart.
And it's really just that the upfront R&D is expensive, silicon is cheap. Humans are ultra expensive all round constantly, so the longer you use the result, the more that initial seemingly ludicrous cost amortizes to near zero.
If I can paraphrase, you believe that humans are "intelligence level" 100, and AI is essentially somewhere at like 10 or 20, and is doubling every 2 years.
First, is AI actually improving 2x every 2 years? Can you link to some studies that show the benchmarks? AFAIK OpenAI with ChatGPT was something of an 8 or 10 year project. It being released seemingly "out of nowhere" really biases the perception that things are happening faster than they actually are.
Second, is human intelligence and AI even in the same ___domain? How many dimensions of intelligence are there? Out of those dimensions, which ones have AI at a zero so far? Which of those dimensions are even entirely impossible for AI? If the answer is "AI" can be just as smart as humans in every way, and we still don't even understand that much about human intelligence and cognition, let alone that of other animals... I'm skeptical the answer is yes. (Thinking about sciences view of actual intelligence, animals, and for a long time the thought was animals are biological automotons, I think shows that we don't even understand intelligence, let alone how to build AGI).
Next, even if the intelligence raise is single dimension and is actually the same sport & playing field, what is to say that the exponential growth you describe will be consistent? Could it not be the case that the 1000x to 1001x improvement might be just as hard as all of the first 1000x improvement? What is to say the complexity increase is not also exponential, or even a combinatoric growth?
Because humans have 0% error rate? Whether or not the error function can be reduced to zero, I anticipate that human meddling will soon present more of a wrench in the gears of machine cognition than a source of error correction. We see this already with the lobotomization of gpt models per "safety"/copyright concerns.
A key difference is humans can validate, perform orthogonal checks. We can prove things. A LLM that is essentially just a NLP, is picking a probability for "what should follow this word, when given a question that 'looks' like this." Once the answer is chosen, AI so far is left with no other options. If someone says the choice is wrong, what can AI do? Chose a less likely option?
For example, humans can prove that 5 times 8 is 40 in a variety of ways. While you might be wrong in arithmetic, you can check your answer. AI can't check its answer, it does not know when it is wrong (it picked an answer it 'thought' was right, ergo it has no ability to consider that as a wrong answer, otherwise it would have chosen a different answer).
A pretty accurate statement - and this, I think, is the realization most likely to kill a lot of consumer AI features. There's a large liability factor that I don't think we'll ever get over.
Of course not. Humans "hallucinate" all the time too.
Why do we hold AI to a higher standard than humans? It's the same with self driving cars. "Oh it had one accident, must not be safe!" and yet humans have 100s of accidents a day.
The main problem here is expectations -- everyone expects the machine to be perfect, and when it's not, it breaks expectations. In the past, machines were generally a lot more accurate, they just didn't do a lot of stuff. Now they they can do a lot more stuff, their accuracy is coming down to human levels, and it throws everyone off.
I don't think we need to fix hallucinations, we need to fix expectations.
Humans don't have 100% perfect recall of every fact they've ever learned, why do we expect AI to?
Because we're building something that runs autonomously at scale. Similar reason why you don't give one human complete power over the entire economy without mountains of trials/elections etc.
Expecting more from AI is being cautious about wanton stereotype machines that will remove plenty of "redundant" humans from the workforce without demonstrating that they are capable of redundancy against error.
If we're digesting and calcifying human knowledge, it's likely we have only one shot to get it right, so we had better do it correctly.
Ask yourself if you would have been fine with calculators if they did math at an equal error rate to humans.
Properly calibration expectations is part of it, but surely we could do better by having the AI cite proper sources and having it show us the line of reasoning? The problem is that right now you ask an LLM a question and it'll confidently give you an incorrect answer without any sources or explanation of the reasoning behind it.
Having tools that shift blame and confidently lie to you is a problem. Here's a recent example that bothered me: I tried using Meta AI to generate an image and it failed. When I asked why it failed, it kept lying and making up excuses for how it was my fault or my browser was doing something wrong. When in reality the probable explanation was that the image prompt included something that was probably flagged as sensitive content, which is why the image generation failed. (It wasn't even anything nefarious, I just tried several variations of catgirls.) If something goes against some content guidelines I want the tool to tell me that, rather than gaslight me with lies.
In most human interactions, do people site their sources? If someone doesn't want to tell you something because it makes them uncomfortable, do they tell you that or do they lie to you to make you feel better?
I'm with you, it would be nice if LLMs could do those things, but also I don't expect my human friends to do that, so I'm not sure why I should expect it of my AI.
You must be kidding. Every day my spouse and I have a conversation that involves one of us realizing that we've mis-remembered something. Not getting every fact correct is super-common in humans.
This right here is a perfect example of the problem of expectations. Do you have 100% accurate recall of every fact you've ever learned?
Everyone hallucinates stuff, you don't need to see the Virgin Mary to hallucinate. You hallucinate stuff every day, and you are not even aware of it. Even had a "déjà vu" ? Saw your black cat in a towel in a dark room ? Every day.
Persistent and frequent/high-percentage hallucinations? Yes, that's cause for concern/medical consultation.
But we don't over-react when humans are fooled by optical illusions/magic tricks, or when they conflate two things, or when they confuse correlation and causation, or when they selectively prioritize some information over other information, or...
> Cook responded, “We’re integrating with other people as well.” […] Apple could eventually bring Google Gemini to iOS, too.
It might be more expensive, but perhaps Apple is uniquely positioned to synthesize/compare answers from multiple models to provide more accurate results.
Tim Cook spent decades protecting both IP and profit by pitting vendors against each other. He's avoided being captive to China, and Apple will not let AI capture its business.
> Cook said he would “never claim” that its new Apple Intelligence system won’t generate false or misleading information with 100 percent confidence.
Well yeah, no shit, that's the technology they're using. Imagining making the opposite claim: that their system will never suffer from the inherent limitations of the underlying technology. The article even has a short paragraph mentioning that literally nobody else can make any stronger claims than this, yet it's presented as a headline-worthy statement.
Hallucinations reduce the success rate of AI workflows, which must be taken seriously.
Imagine a workflow with 8 steps where each step/agent has a 95% success rate, the success rate of this workflow is only (1-0.05)^8 = 0.66 ~= 66%. Not bad but not enough to replace humans yet (unless 66% makes you profitable).
The hallucinations/errors compound and can misguide decisions if you rely too much on AI.
Not enough to replace humans in most critical tasks, but enough to replace Google, that's for sure. My own success rate to find information on Google these days is around 50% by query at best.
I prefer "confabulation," which describes the analogous human behavior where you have no idea what the objective truth actually is, so you just make up something that sounds right
Their developers have intent. That intent is to give the perception of understanding/facts/logic without designing representations of such a thing, and with full knowledge that as a result, it will be routinely wrong in ways that would convey malicious intent if a human did it. I would say they are trained to deceive because if being correct was important, the developers would have taken an entirely different approach.
generating information without regard to the truth is bullshitting, not necessarily malicious intent.
for example, this is bullshit because it’s words with no real thought behind it:
“if being correct was important, the developers would have taken an entirely different approach”
If you are asking a professional high-stakes questions about their expertise in a work context and they are just bullshitting you, it's fair to impugn their motives. Similarly if someone is using their considerable talent to place bullshit artists in positions of liability-free high-stakes decisions.
Your second comment is more flippant than mine, as even AI boosters like Chollet and LeCun have come around to LLMs being tangential to delivering on their dreams, and that's before engaging with formal methods, V&V, and other approaches used in systems that actually value reliability.
Hallucinating has the implication of being wrong. The word further adds the context of being elaborately wrong. That feels pretty accurate to describe an AI going into detail when it is wrong.
I know plenty of actual humans who speak with apparent authority and experience on all variety of things they know nothing about. Current AI has copied an existing human behavior if anything.
Of course not. Hallucinations are the only things these models can produce, they have no mechanism to tell if what they're writing is a fact or not. They generate bullshit in the "on bullshit" by Frankfurt sense of the word.
> It is intractable for you to confirm every fact you believe
A 'fact' that is not confirmed is an opinion. A 'opinion' sincerely held to be true without data is called faith.
I do think humans are quite bad at recognizing that most of what they think they know is opinion, and/or the 'facts' are often based on personal experience (which is by definition a cherry-picked data set), and thus is also an opinion too.
I think as well that those that practice science extensively are more practiced to really identify what is fact vs unsupported opinion. Without that practice, it's a lot easier IMO to then think a person knows a lot more facts than they really do. Which is to say, humans generally know very little, and it's not terribly comfortable to acknowledge and feel that way.
Which, leads us to the ultimate reality. Most of what a lay person has to say would be opinion, it's not really worth much - and it's even worse when data no longer matters.
Computers are nothing more than a set of persisted electrical 0/1 signals and a series of logic gates with intricate timing. There is no 'facts' in that world.
For example, "Chat GPT" does not understand questions, it does some pattern matching to find responses that probabilistic-ally would follow. Another example, AI does not understand it is drawing "a hand", it's just a probability algorithm that indicates these pixels are likely to be "good" following these other pixels.
If someone can build a 'reasoning' step into AI - that would likely be a game changer. An AI that generates an answer, and then compares that answer against an actual fact-database; and then modifies that answer in light of the existing facts. Even better, one that can also challenge when an understood fact is likely to be wrong. To do all that, the AI would need to understand abstract concepts. To give a baseline of where we are on that, my cat is able to do that and AI is currently at zero capability to do it.
I'm not sure what it means but a lot of people on this forum have a strong tendency to denigrate human beings whenever people bring up that LLMs kind of suck. I think this undersells what humans are capable of and distracts from the discussion about what LLMs can do. I'm really tired of this discussion immediately getting sidetracked into some philosophical quagmire about the nature of thinking and the mind when someone is trying to bring up a point about LLMs themselves. When this Apple Intelligence feature comes out, I won't have a tiny human in my phone telling me things, I'll have a ChatGPT api call. LLMs are being integrated into these systems, not human beings. When someone talks about other systems, like search engines, nobody talks about how they compare to humans. It's weird that we're dead set on making the comparison with these systems in particular. Maybe that's down to poorly chosen names for things like artificial intelligence and neural networks inviting the comparison.
> It's weird that we're dead set on making the comparison with these systems in particular. Maybe that's down to poorly chosen names for things like artificial intelligence and neural networks inviting the comparison.
It could be that those names are not poorly chosen. People want grant money, companies want venture capital money. What better way to do that than to use specifically chosen names (also known as propaganda!)
Next, there is certainly the "main character syndrome", and the hope of AGI and that these things are on that path, and we are the ones to unlock this. Someone with that point of view, would have a lot of incentive to make those comparisons.
No, humans legit just suck at associative memory for textual data. This happens to be one of the few things LLMs excel at. LLMs suck at sample efficiency, they do not model most types of cognition, they suck at multimodality (so far), in-context learning, etc, etc. But they absolutely blow humans out of the water on associative memory and taking advantage insanely large amounts of it instantaneously.
Humans will not be outdone on other things in my or my children’s lifetime.
What use is associative memory for textual data without the ability to also validate an answer? Honest question.
The answer I come to is "fancy autocomplete." If you know there is a significant chance (non zero) of a wrong answer, you have to validate every answer. To me, that describes auto-complete. Useful, but not something you can just use blindly (which greatly limits the utility, ie: human still required).
Some useless fun about the halting problem is that it only works in a mathematical sense, because all real-world programs halt if given enough time. The heat death of the universe comes for us all in the end.
Is anyone expecting anything more at this point? Maybe it’ll improve over time. But expecting Apple’s implementation of ChatGPT to be better than ChatGPT is unrealistic
I'm impressed with ChatGPT et al right until I query something I know a lot about, and it then proceeds to "hallucinate" far worse than any journalist or newspaper.
It makes me question if every other piece of information it's given me is factual.
It is the biggest "elephant in the room" and if/until it's corrected, every single one of these companies is on a trajectory towards worthlessness. It is a much bigger problem than "People make mistakes too."