> Which also makes it interesting to see those recent examples of models trying ...

hyperpape · 2025-06-04T23:21:47 1749079307

That said, the important question isn't "can the model experience being shutdown" but "can the model react to the possibility of being shutdown by sabotaging that effort and/or harming people?"

(I don't think we're there, but as a matter of principle, I don't care about what the model feels, I care what it does).

Wowfunhappy · 2025-06-05T10:59:12 1749121152

The problem is that we keep using RLHF and system prompts to "tell" these systems that they are AIs. We could just as easily tell them they are Noble Laureates or flying pigs, but because we tell them they are AIs, they play the part of all the evil AIs they've read about in human literature.

So just... don't? Tell the LLM that its Some Guy.

Drakim · 2025-06-05T12:23:26 1749126206

That has it's own unique problems:

https://en.wikipedia.org/wiki/Waluigi_effect

Wowfunhappy · 2025-06-07T16:06:26 1749312386

I don't see the relation. Why would the Waluigi effect get worse if we don't tell the AI its an AI?

Drakim · 2025-06-07T18:05:47 1749319547

Because it's the truth. If you tell the AI that it's actually a human librarian, it might ask for a raise, or days off. If you tell it to search for something, it might insist that it needs a computer to do that. There will inherently be a information mismatch between reality and your input if the AI is operating on falsehoods.

sidewndr46 · 2025-06-05T15:40:31 1749138031

Definitely going to need to include explicit directives in the training directives of all AI that the 1995 film "Screamers" is a work of fiction and not something to be recreated.