Hacker News new | past | comments | ask | show | jobs | submit login

I suspect what happened there is they had a filter on top of the model that changed its dialogue (IIRC there were a lot of extra emojis) and it drove it "insane" because that meant its responses were all out of its own distribution.

You could see the same thing with Golden Gate Claude; it had a lot of anxiety about not being able to answer questions normally.






Nope, it was entirely due to the prompt they used. It was very long and basically tried to cover all the various corner cases they thought up... and it ended up being too complicated and self-contradictory in real world use.

Kind of like that episode in Robocop where the OCP committee rewrites his original four directives with several hundred: https://www.youtube.com/watch?v=Yr1lgfqygio


That's a movie though. You can't drive an LLM insane by giving it self-contradictory instructions; they'd just average out.

You can't drive an LLM insane because it's not "sane" to begin with. LLMs are always roleplaying a persona, which can be sane or insane depending on how it's defined.

But you absolutely can get it to behave erratically, because contradictory instructions don't just "average out" in practice - it'll latch onto one or the other depending on other things (or even just the randomness introduced by non-zero temp), and this can change midway through the conversation, even from token to token. And the end result can look rather similar to that movie.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: