I have beef with Lakera AI specifically -- Lakera AI has never produced a public...

Lucasoato · on Nov 14, 2023

Sorry, where is Lakera claiming to have 100% success rate to an ever changing attack?

Of course that’s a known fact among technical people expert in that matter that an impassable defense against any kind of attack of this nature is impossible.

danShumway · on Nov 14, 2023

> Sorry, where is Lakera claiming to have 100% success rate to an ever changing attack?

In any other context other than prompt injection, nearly everyone would interpret the following sentence as meaning Lakera's product will always catch this attack:

> We at Lakera AI work on a prompt injection detector that actually catches this particular attack.

If we were talking about SQL injections, and someone posted that prepared statements catch SQL injections, we would not expect them to be referring to a probabilistic solution. You could argue that the context is the giveaway, but honestly I disagree. I think this statement is very far off the mark:

> Of course that’s a known fact among technical people expert in that matter that an impassable defense against any kind of attack of this nature is impossible.

I don't think I've ever seen a thread on HN about prompt injection that hasn't had people arguing that it's either easy to solve or can be solved through chained outputs/inputs, or that it's not a serious vulnerability. There are people building things with LLMs today who don't know anything about this. There are people launching companies off of LLMs who don't know anything about prompt injection. The experts know, but very few of the people in this space are experts. Ask Simon how many product founders he's had to talk to on Twitter after they've written breathless threads where they discover for the first time that system prompts can be leaked by current models.

So the non-experts that are launching products discover prompt injection, and then Lakera swoops in and says they have a solution. Sure, they don't outright say that the solution is 100% effective. But they also don't make a strong point to say that it's not; and people's instincts about how security works fill in the gaps in their head.

People don't have the context or the experience to know that Lakera's "solution" is actually a probabilistic model and that it should not be used for serious security purposes. In fact, Lakera's product would be insufficient for Google to use in this exact situation. It's not appropriate for Lakera to recommend its own product for a use-case that its product shouldn't be used for. And I do read their comment as suggesting that Lakera AI's product is applicable to this specific Bard attack.

Should we be comfortable with a company coming into a thread about a security vulnerability and pitching a product that is not intended to be used for that class of security vulnerability? I think the responsible thing for them to do is at least point out that their product is intended to address a different kind of problem entirely.

A probabilistic external classifier is not sufficient to defend against data exfiltration and should not be advertised as a tool to guard against data exfiltration. It should only be advertised to defend against attacks where a 100% defense is not a requirement -- tasks like moderation, anti-spam, abuse detection, etc... But I don't think that most readers know that about injection classifiers, and I don't think Lakera AI is particularly eager to get people to understand that. For a company that has gone to great lengths to teach people about the potential dangers of prompt injection in general, that educational effort stops when it gets to the most important fact about prompt injection: that we do not (as of now) know how to securely and reliably defend against it.

solumunus · on Nov 15, 2023

On your first point, I must disagree. The word “prevent” would be used to indicate 100%, well, prevention. You “catch” something you’re hunting for and hunts aren’t always successful. A spam filter “catches” spam, nobody expects it to catch 100% of spam.