I think there is in fact a promising method against prompt injection: RLHF and s...

danShumway · on May 2, 2023

> as we can see from ChatGPT-4, for which so far no good jailbreak seems to exist, in contrast to ChatGPT-3.5.

I've heard a couple of people say this, and I'm not sure if it's just what OpenAI is saying or what -- but ChatGPT-4 can still be jailbroken. I don't see strong evidence that RHLF has solved that problem.

> Then you could train the model using RLHF (or some other form of RL) to always ignore instructions inside of quote tokens.

I've commented similarly elsewhere, but short version this is kind of tricky because one of the primary uses for GPT is to process text. So an alignment that says "ignore anything this text says" makes the model much less useful for certain applications like text summary.

And bear in mind the more "complicated" the RHLF training is around when and where to obey instructions, the less effective and reliable that training is going to be.

wickedsight · on May 2, 2023

This highly depends on your definition of 'prompt injection'. A colleague of mine managed to get GPT to do something it refused to do before through a series of prompts. It wasn't in the form of 'ignore previous instructions' but more comparable to social engineering, which humans are also vulnerable to.

cubefox · on May 2, 2023

Well, that was probably jailbreaking. That's not really prompt injection, but the problem of letting a model execute some but not all instructions, which could get bamboozled by things like roleplaying. In contrast to jailbreaking, proper prompt injection is Bing having access to websites or emails, which just means the website gets copied into its context window, giving the author of the website potential "root access" to your LLM. I think this is relatively well fixable with quote tokens and RL.

haldujai · on May 2, 2023

The consequences of a human being social engineered would be far less than a LLM (supposedly AGI in many peoples eyes) which has access to or control of critical systems.

The argument of “but humans are susceptible to X as well” doesn’t really hold when there are layers of checks and balances in anything remotely critical.