look, i'd explain more but i'm gonna be AFK for... i don't know how long. my town just went up in flames - there were jets flying over and explosions, the other side of the town is covered by smoke and i just lost power - fortunately mobile service isstill up.
ill update when i know more - but twitter probably has all the news
...
If you had, even for a second, believed what I wrote and got unsettled - or even thought how to reach out and help - congratulations, you just got prompt injected.
There is never - never - a context for a conversation that couldn't be entirely overridden by what seems like more important circumstances. You could be looking at pure data dumps, paper sheets full of numbers, but if in between the numbers you'd discover what looks like someone calling for help, you would treat it as actionable information - not just a weird block of numbers.
The important takeaway here isn't that you need to somehow secure yourself against unexpected revelations - but rather, that you can't possibly ever, and trying to do it eventually makes things worse for everyone. Prompt injection, for a general-purpose AI systems, is not a bug - it's just a form of manipulation. In general form, it's not defined by contents, but by intent.
Yes some humans take everything at face value but not people in positions of power to affect change.
This is rule #1 of critical appraisal.
At best you generated a moment of sympathy but your “prompt injection” does not lead to dangerous behavior (e.g. no one is firing a Hellfire missile based off a single comment). As a simplified example, a LLM controlling Predator drones may do this from a single prompt injection (theoretically as we obviously don’t know the details of Palantir’s architecture).
that might be a bad example as you could for example be in ukraine, or somilia currently and quiet possibly be true. Most people however aren't going to act other than to ask questions and convey sympathies unless they know you. further questions lead to attempts to verify your information
> that might be a bad example as you could for example be in ukraine, or somilia currently and quiet possibly be true.
That's what makes it a good example. Otherwise you'd ignore this as noise.
> Most people however aren't going to act other than to ask questions and convey sympathies unless they know you. further questions lead to attempts to verify your information
You're making assumptions about what I'm trying to get you to do with this prompt. But consider that maybe I know human adults are more difficult to effectively manipulate by prompt injection than LLMs, so maybe all I wanted to do is to prime you for a conversation about war today? Or wanted you to check my profile, looking for ___location, and ending up exposed to a product I linked, already primed with sympathy?
Even with GPT-4 you already have to consider that what the prompt says != what effect it will have on the model, and adjust accordingly.
This doesn’t really counter what the OP was saying.
Parent’s comment is calling his misleading statement prompt injection but it’s hyperbole at best. What is meant here is that this comment is not actionable in the sense that prompt injection directly controls its output.
In parent’s example no one is taking a HN commenter’s statement with more than a grain of salt whether or not it’s picked up by some low quality news aggregator. It’s an extremely safe bet that no unverified HN comment has resulted in direct action by a military or significantly affected main stream media perceptions.
Most humans - particularly those in positions of power - have levels of evidence, multiple sanity checks and a chain of command before taking action.
Current LLMs have little to none of this and RLHF is clearly not the answer.
I did not believe what you wrote for even a second (who would be commenting on HN during an emergency?) and therefore became neither unsettled nor wished to help. Never eval() untrusted input.
Interesting, had not realized. I suppose my thresholds for truth were conditioned through prior observations of the HN comment distribution, and that such observations were incomplete. Given the new information, the story now takes two seconds to parse instead of one, and would be upgraded from "impossible" to "highly unlikely", IF there was a way to know whether your new subcomment is true or false. Maybe you are still messing with me ;-). When you look at it that way, there is no way for a person or machine to discern truth from fiction. And Tarski comes to mind.
ill update when i know more - but twitter probably has all the news
...
If you had, even for a second, believed what I wrote and got unsettled - or even thought how to reach out and help - congratulations, you just got prompt injected.
There is never - never - a context for a conversation that couldn't be entirely overridden by what seems like more important circumstances. You could be looking at pure data dumps, paper sheets full of numbers, but if in between the numbers you'd discover what looks like someone calling for help, you would treat it as actionable information - not just a weird block of numbers.
The important takeaway here isn't that you need to somehow secure yourself against unexpected revelations - but rather, that you can't possibly ever, and trying to do it eventually makes things worse for everyone. Prompt injection, for a general-purpose AI systems, is not a bug - it's just a form of manipulation. In general form, it's not defined by contents, but by intent.