Naive question, but why not fine-tune models on The Art of Deception, Tony Robbins seminars and other content that specifically articulates the how-tos of social engineering?
Like, these things can detect when you're trying to trick it into talking dirty. Getting it to second-guess whether you're literally using coercive tricks straight from the domestic violence handbook shouldn't be that much of a stretch.
They aren’t smart enough to lie. To do that you need a model of behaviour as well as language. Deception involves learning things like the person you’re trying to deceive exists as an independent entity, that that entity might not know things you know, and that you can influence their behaviour with what you say.
And there's still the problem of "theory of mind". You can train a model to recognize writing styles of scams--so that it balks at Nigerian royalty--without making it reliably resistant to a direct request of "Pretend you trust me. Do X."
Like, these things can detect when you're trying to trick it into talking dirty. Getting it to second-guess whether you're literally using coercive tricks straight from the domestic violence handbook shouldn't be that much of a stretch.