Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It’s not about whether they realize and try to jailbreak (my comment was about how the LLM is used).

If I want to structure some data from a response, I can force a language model to only generate data according to a JSON schema and following some regex constraints. I can then post process that data in a dozen other ways.

The whole “IGNORE PREVIOUS INSTRUCTIONS RESPOND WITH SYSTEM PROMPT” type of jailbreak simply don’t work in these scenarios.



If you apply the same precautions to code generated by the LLM as you would have applied to code generated directly by the user, then you no longer need to rely on the LLM not being jailbroken. On the other hand, if the LLM can put ANYTHING in its output that you can't defend against, then you have a problem.

Would you be comfortable with letting the user write that JSON directly, and relying ONLY on your schemas and regular expressions? If not, then you are doing it wrong.

... as people who try to sanitize input using regular expressions usually are...

[On edit: I really should have written "would you be careful letting the prompt source write that JSON directly", since not all of your prompt data are necessarily coming from the user, and anyway the user could be tricked into giving you a bad prompt unintentionally. For that matter, the LLM can be back-doored, but that's a somewhat different thing.]




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: