Hacker Newsnew | past | comments | ask | show | jobs | submit | saladtoes's commentslogin

https://www.lakera.ai/blog/claude-4-sonnet-a-new-standard-fo...

These LLMs still fall short on a bunch of pretty simple tasks. Attackers can get Claude 4 to deny legitimate requests easily by manipulating third party data sources for example.


They gave a bullet point in that intro which I disagree with: "The only way to make GenAI applications secure is through vulnerability scanning and guardrail protections."

I still don't see guardrails and scanning as effective ways to prevent malicious attackers. They can't get to 100% effective, at which point a sufficiently motivated attacker is going to find a way through.

I'm hoping someone implements a version of the CaMeL paper - that solution seems much more credible to me. https://simonwillison.net/2025/Apr/11/camel/


I only half understand CaMeL. Couldn't the prompt injection just happen at the stage where the P-LLM devises the plan for the other LLM such that it creates a different, malicious plan?

Or is it more about the user then having to confirm/verify certain actions and what is essentially a "permission system" for what the LLM can do?

My immediate thought is that that may be circumvented in a way where the user unknowingly thinks they are confirming something safe. Analogous to spam websites that show a fake "Allow Notifications" prompt that is rendered as part of the actual website body. If the P-LLM creates the plan it could make it arbitrarily complex and confusing for the user, allowing something malicious to happen.

Overall it's very good to see research in this area though (also seems very interesting and fun).


The idea is that the P-LLM is never exposed to interested data.


Agreed on CaMeL as a promising direction forward. Guardrails may not get 100% of the way but are key for defense in depth, even approached like CaMeL currently fall short for text to text attacks, or more e2e agentic systems.


What security measure, in any domain, is 100% effective?


Using parameters in your SQL query in place of string concatenation to avoid SQL injection.

Correctly escaping untrusted markup in your HTML to avoid XSS attacks.

Both of those are 100% effective... unless you make a mistake in applying those fixes.

That is why prompt injection is different: we do not know what the 100% reliable fixes for it are.


Fair point - "the only way to" is probably too strong a framing. But I think the core argument stands: while model-level safety improvements are valuable, they're not sufficient for securing real applications. Claude is clearly the safest model available right now, but it's still highly susceptible to indirect prompt injection attacks and remains practically unaligned when it comes to tool use. The safety work at the model level helps with direct adversarial prompts, but doesn't solve the fundamental architectural vulnerabilities that emerge when you connect these models to external data sources and tools - for now.


None; but, as mentioned in the post, 99% is considered a failing grade in application security.


I've been playing Gandalf in the last few days, it does a great job at giving an intuition for some of the subtleties of prompt engineering: https://gandalf.lakera.ai

Thanks for putting this together!


Whoa that was a lot of fun. Are you aware of any other games like this? A sort of CTF for AIs?


https://securitycafe.ro/2023/05/15/ai-hacking-games-jailbrea... showed up in my feed this morning but I haven't tried them to know if they're any fun

I also found the nondeterministic behavior of Gandalf robbed it of being "fun," to say nothing of the 429s (which they claim to have fixed but I was so burned by the experience I haven't bothered going back through the lower levels to find out)


Explainability is not a given in many more traditional complex systems. Decisions are often an aggregation of a large number of signals, and one can often not conceive of a single intuitive explanation for the system's decisions.

A lot is expected of AI systems today, from fairness (how do we even define that?) to universality. In my view we need to develop a practical understanding of what it means to build the system we have in mind: do I understand where I want my system to perform, and do I have the tools to assess whether I am getting there? Interpretability is orthogonal to all of this.

I would much rather have a well tested system, accompanied by online monitoring to detect unusual inputs in an ever-changing data distribution and notify when updates are needed or a human needs to take control, than an unreliable system that is great at providing explanations.


> I would much rather have

False dichotomy. In fact, well understood systems must be more reliable.


Is that really a universal fact? In any case, my statement goes in the opposite direction: is a reliable system necessarily "well understood", in the sense that it can explain its decisions? Most complex systems powering our lives cannot tell us anything about how they made those decisions.

AI systems add a layer of complexity. Even if you can explain a decision well on your training data, I seriously doubt you will be able to still provide reasonable explanations in completely out of distribution data.


Interesting, thanks for sharing! Somehow I'm not surprised. My experience building systems for real world applications is that choosing a simple CNN is usually the way to go, and it's all in the data. Choosing more complex model classes also comes at a cost, higher variance and more likelihood of issues related to robustness and generalization.


“The latter being when the training or test data follows a different distribution to the in-operation data”

This form of ML bug is the most challenging to catch. The true in-operation distribution is often unknown which makes testing for such bugs a very challenging problem. Any thoughts on this?


Thanks for your comment. The whole field of run-time monitoring is concerned with this problem. It's a tough one to crack when the distribution changes are subtle, but you can and should at least check simple data attributes for consistency.



"High-risk AI systems should bear the CE marking to indicate their conformity with this Regulation so that they can move freely within the Union"

That would be something!


Thanks for sharing!


Oh no!! I really wanted to know how they develop their AI so reliably :). This is a massive issue today. Hope we develop tools and processes to get us there soon.”


Thanks for reading and your comment! :-)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: