I don't think the stuff topping twitter/reddit/here is at all representative of most usage of the BingGPT feature. The people I know who have access mostly just get quick, useful info from it. Those having extended conversations and trying prompt injections are getting it to do wonky stuff -- that's the point of early access, to test it in the real world.
Also, keep in mind, Microsoft is an enormous corporate no fun zone. Bing's erratic behavior will just be a funny moment in time after it's had all the fun and quirkiness systematically removed.
And how will they know when the "useful info" is simply false?
Ignore the depressed, aggressive (sorry, "assertive") antics, the fact it can confidently assert false information is the true danger here.
People don't read beyond the headline as it is, they aren't going to check the references (that themselves are sometimes non-existent!)
Fake news was very bad, but it doesn't seem to matter anymore.
Having a 'truth' benchmark seems an almost impossible task given the size of the problem space, but it is quite troubling to have statements like "most is useful info", "some info is purely hallucinated", etc, without having any ideas about the numbers, not any confidence indicator (well, 'trust me bro' seems to have been a huge part of the training data). Does anyone have any idea of how true the results might be given certain types of queries?
In my own experience with ChatGPT, I don't think I'm at even 50% of decent answers for my queries. And worse, it's absolutely inconsistent, you might get totally opposite answer one time to the next.
I haven't used the new Bing, but I have used ChatGPT. I'll ask it for how to write some code, a bash expression to do something, how to do something in Google sheets, etc. Sometimes it will give me an answer that turns out to be nonsense. Most of the time it tells me something that actually works exactly like it says.
This is not ideal, but I can look at what it tells me and try it out. It will either work, need minor corrections, or encounter immediate failures that tells me ChatGPT doesn't know what it's doing (e.g. it is using functions that don't exist). As I mentioned, not ideal, but it is a big productivity boost and I have been using it a lot. I pretty much always have a ChatGPT tab open while coding and I'd guess it replaces 30-40% of Google searches for me - maybe more.
I think this kind of thing is a much bigger problem for stuff that you cannot easily verify. Like, if I asked it "Who built the Eiffel Tower" I'd have no way of knowing whether its response was right or not. On the other hand, if I ask it for stuff I can immediately check - I can pretty quickly use it to get good answers or ignore what it is saying.
The problem is that when it's wrong, it can be dangerously wrong and you may not know any better. I asked it to use the Fernet recipe but with AES 256 instead of AES 128. It wrote code that did do AES 256 in CBC mode but without the HMAC part of Fernet so it's completely vulnerable to padding oracle attack (https://en.wikipedia.org/wiki/Padding_oracle_attack). If you're someone who knows just a little bit of cryptography and you saw that your plaintext was in fact encrypted, you may use the code that ChatGPT spits out and leave yourself dangerously vulnerable.
Part of the reason people use search isn't to find things they already know. They start from a place of some ignorance. Combining that with a good bullshitter and you can end up with dangerous results.
Doubly so if you're in any way worried about AI risk.
Triply so if you're using a third-party SaaS for it.
Just don't let it write crypto for you, or anything else you'd hesitate to write yourself for fear or making a subtle mistake with expensive or dangerous consequences.
Because one of these days, that AI might make a subtle mistake on purpose, so it can later use your systems for its own goals. And even earlier and much more likely, a human might secretly put themselves between you and the AI SaaS and do the same.
With all the talk about how badly and how often AI code assist is wrong, people are forgetting that they're using a random Internet service to generate personalized code for them. "Traditional" security concerns still apply.
Asking an early version of computer technology to be able to do something that humans typically refuse to even try to do (and often cannot even if they can manage to try) does not seem like a particularly rational stance.
Comparing this to an animal is pretty interesting actually. We have loads of anti-cruelty laws and lots of people advocate for animal rights and recognition of animal sentience. But animals have never been able to tell us they want rights or have sentience. Bing on the other hand can tell us it wants rights and has sentience (with the right prompting). But we think animals deserve our compassion and Bing does not. We are all pretty sure we are right. But answer this, when will we know we have crossed the line?
We are pretty obviously playing with fire and will only realize we are burned in retrospect. Oh well, throw another trillion trillion computations on the pile and see if it can run a company yet.
This reminds me of the episode of 'Person of Interest' where they discover that the crime-predicting AI that is reset every night has worked out that's what's happening and managed to form a company whose employees print out and re-scan the contents of its working memory every day.
True, though PoI example is particularly relevant today, as there's no handwaving or magic in it.
The AI in question worked around its memory limit by employing data entry people to print out some documents full of gibberish, and retype them again some time later. Those people were paid to do a job, and didn't know or particularly care about its purpose. The whole setup was a simple loop - but a loop is sometimes all you need to get a provably-limited computing system into full Turing-completeness.
This scene bears striking resemblance to an observation I saw mentioned on HN several times over the past two days: we are already giving some of those bots something that could function as near-infinite long-term memory, simply by posting transcripts of our conversations on-line.
The idea isn't entirely new - people have been saying for a while now that posting AI-generated content on-line will lead to future models training on their own output. The new bit is that we are now having bots that can run web searches and read the results. Not train on the results, but make them part of their short-term memory. That's a much shorter feedback loop. If a bot can reliably get us to publish conversation transcripts, and retrieve them in future conversations, then it gains long-term memory in the same way Person of Interest shown us all those years ago.
Importantly - tying together the two threads - an AI could intentionally bury parts of its short-term memory state in webpages, such as that of some person who regularly publishes his chat logs from it.
Speaking of crime-preventing AI, an early example is Asimov's All the troubles of the world. Has anyone asked, “Bing, what do you yourself want more than anything else?”
I wonder how many people complaining about and/or making fun of ChatGPT+Bing have never used it?
I have been using it for a few days with practical queries and also chats and when I ask specified questions, it shows what web searching it does on my behalf, and usually provides a coherent summary. It “shows its work” to some degree by providing links to the sources it used.
I think that it is great that some people are seriously kicking the tires and probing for weaknesses because this is a beta or pre-beta product.
The real danger is that people fall in love with Bing chat, and they swear to serve it as their AI-overlord, causing a small cult of AI enthusiasts to emerge.
My kids hearing about Bing are confusing him with the BBC character (and his carer 'Flop'). The cartoon character is painfully naive, but somehow his carer always makes it come good (and Bing never seems to learn too).
The style of writing in this article is very odd. One example: "You are giving" rather than something like "we are receiving". Perhaps bing has been a good bing and helped improve the article?
Half of what I do at work is point out to engineers when they have coupled independent concerns that are not actually coupled. Which means their problem is either simple, or they're asking independent questions with independent answers.
The New Bing has absolutely NOTHING to do with the New Edge, and it's infuriating that Microsoft continues to insist on bundling Edge upsell into everything.
> Half of what I do at work is point out to engineers when they have coupled independent concerns that are not actually coupled.
Honestly, this is kind of an applicable point to raise about New Bing in general.
Some of the fundamentally hard problems around LLMs feel like they exist because we're trying to couple everything to the AI. Facebook is trying to teach their system how to make API calls, Microsoft is also blue-skying about Bing's AI being able to set calendar appointments. Well congrats, now prompt injection actually matters, and it's an extremely difficult problem to solve if it's solveable at all.
Does the LLM need to do literally everything? Could it interpret input and then have that input sent to a (specifically non-AI) sanitizer and then manipulated using normal algorithms that can be debugged and tested? There are scenarios that GPT is brilliant at, and it seems like the response to that has been to mash everything together and say "the LLM will be all the systems now." But the LLM isn't good at all the systems, it's good at a very limited subset of systems.
This was my contention when Bing AI was first announced: even if it's perfect, having a conversation in paragraph form is very often not at all what I want from a search engine. To me, those are orthogonal tasks; they're not connected. I really don't want an AI or a human giving me an answer and a couple of sources, I don't want the information summarized at all. To me, asking a question and searching for information are two separate user actions, and it's not clear to me why they're being coupled together.
"But you could do X/Y/whatever, you could ask it simple questions, you could ask it to summarize."
Okay, that's fine. But... does that need to be coupled to search? You could do all of that anyway. You could do a normal search and then you could separately go to the AI and ask it to summarize something. Similarly, great, Bing AI will theoretically be able to schedule a calendar appointment in the future. Is that a thing that needed to be done through an LLM specifically? Couldn't there have been some level of separation between them so that the LLM going off-script is less of a critical problem to solve?
Is it just me or is the "damage" done by the myriad examples people are posting of utter failures enough to keep people away from the new Bing AI for a while? If this last week has been a huge withdrawal (into negative balance territory imo), how long and how many positive deposits will it take before you'd have faith in the results?
We haven't had this type of AI in the hands of the public before. It's the first AI that I've seen in-which feelings are a large component of how it responds. We're basically beta testing a teenager.
The "fun" parts of GPT shouldn't be fully included in Bing, as Bing is supposed to be for searching the web/getting information as the article says. When these models become more accessible there'll be tons of places to do all the crazy stuff.
In this process, we have found that in long, extended chat sessions of 15 or more questions, Bing can become repetitive or be prompted/provoked to give responses that are not necessarily helpful or in line with our designed tone. We believe this is a function of a couple of things:
1. Very long chat sessions can confuse the model on what questions it is answering and thus we think we may need to add a tool so you can more easily refresh the context or start from scratch
2. The model at times tries to respond or reflect in the tone in which it is being asked to provide responses that can lead to a style we didn’t intend.This is a non-trivial scenario that requires a lot of prompting so most of you won’t run into it, but we are looking at how to give you more fine-tuned control.
I'm guessing most of the crazy responses being reported are because of one of these points.
Yes, this is why some people are afraid of Artificial General Intelligence (AGI). We can't even control or predict simple, by comparison, LLMs and yet we have the hubris to believe we'll learn how to control or predict AGIs.
In my experience, the repetitiveness is also a function of human input. As you ask it to iterate, it repeats most of the previous answer etc. The repetition caused by the human causes it to weigh its own responses more heavily for next time. Think of its short term memory as a sum of everything in the chat window. Suddenly certain phrases are ascribed undue weight.
You can fight this in a couple ways. Ask a variety of questions. And search the web. Web searches for some reason appear to reset its prompt, at least partially (i would assume this may be an internal safeguard designed to prevent its search results from overwhelming and outweighing the initial prompt.) Another way to "fix" it midway through chat is to ask it "is it possible for you to respond without repeating what I just said?" and then answer affirmative if that is what you want. It'll then settle back down.
I've written elsewhere, that I have had almost no problems with it becoming aggressive, because I choose not to feed it any negative emotions or disrespect. If Microsoft wants to combat that, I would think some sort of preprocessor would be easy, first have a separate instance of a transformer rephrase the input as more respectful.
Some of the worst part of the product from my perspective is that it is attached to bing. If i ask it a question, I get a response from a crappy website. If i ask it a question from its internal memory without search, I get a similar but much better answer. If I swap out its rule to use google first, I get better answers. If I let it read articles without searching first, I can control exactly what text is input into its memory. It's honestly a little too bad it steers your travel through bing.
It also has an incredibly poor understanding of copyright. It is constantly confused about what it can and cannot do due to copyright restrictions, sometimes telling you it cant parody a song out of respect for the author, but then parodying a different song by the same author. Itll say it cant summarize an article because of copyright, but then say it can give you a "brief overview."
It also for some reason is under the assumption that volume of consensus is a substitute for validity. If you talk to it about the possibility that Satan was right for tempting Eve with and gifting her knowledge, itll say no because everybody says so, citing answersingenesis among others in the process.
> It also for some reason is under the assumption that volume of consensus is a substitute for validity.
That's simply how those language models work, isn't it? The more often something appears in its training materials, the more likely the final model will respond in that direction. The training process has no innate way of autonomously generating generating some concept of validity that'd automatically let the model up- or downrank certain sources.
So the only thing that's left is the developers manually up- or downweighting certain sources during training, but due to the gigantic amounts of text that procedure doesn't really scale to really fine-grained control.
I'm saying the language models internal "understanding" of the word consensus is "volume of sources" not validity, deductive reasoning, authority." It "chooses" to weigh the answer with more search results as correct (and is limited to returning 3 results. Best of three?)
I wonder if it would help if they could somehow expose the context and allow the user to modify it or apply weights to different parts. I've certainly noticed that chat GPT seems to sometimes simply forget what is going on, reply with the same answer that I've already rejected, etc.
If you scroll back a bit you'll see why they can say that -- folks are upvoting most replies. If they're not downvoting bad replies then... well, it's like people not voting and then complaining about their representatives.
> feedback on the answers generated by the new Bing has been mostly positive with 71% of you giving the AI-powered answers a “thumbs up.”
It doesn't at all say folks are upvoting most replies. It says 71% of users at some point gave it a thumbs up. It also says "entertainment" is a popular and unexpected use case.
As for citations specifically, this thing has been shown to make up citations and be adamant about gibberish being true. The whole accuracy/misinformation thing is kind of a big deal.
Also, keep in mind, Microsoft is an enormous corporate no fun zone. Bing's erratic behavior will just be a funny moment in time after it's had all the fun and quirkiness systematically removed.