Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Which also makes it interesting to see those recent examples of models trying to sabotage their own "shutdown"

To me, your point re. 10 seconds or a billion years is a good signal that this "sabotage" is just the models responding to the huge amounts of sci-fi literature on this topic



That said, the important question isn't "can the model experience being shutdown" but "can the model react to the possibility of being shutdown by sabotaging that effort and/or harming people?"

(I don't think we're there, but as a matter of principle, I don't care about what the model feels, I care what it does).


The problem is that we keep using RLHF and system prompts to "tell" these systems that they are AIs. We could just as easily tell them they are Noble Laureates or flying pigs, but because we tell them they are AIs, they play the part of all the evil AIs they've read about in human literature.

So just... don't? Tell the LLM that its Some Guy.


That has it's own unique problems:

https://en.wikipedia.org/wiki/Waluigi_effect


I don't see the relation. Why would the Waluigi effect get worse if we don't tell the AI its an AI?


Because it's the truth. If you tell the AI that it's actually a human librarian, it might ask for a raise, or days off. If you tell it to search for something, it might insist that it needs a computer to do that. There will inherently be a information mismatch between reality and your input if the AI is operating on falsehoods.


Definitely going to need to include explicit directives in the training directives of all AI that the 1995 film "Screamers" is a work of fiction and not something to be recreated.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: