> Which also makes it interesting to see those recent examples of models trying to sabotage their own "shutdown"
To me, your point re. 10 seconds or a billion years is a good signal that this "sabotage" is just the models responding to the huge amounts of sci-fi literature on this topic
That said, the important question isn't "can the model experience being shutdown" but "can the model react to the possibility of being shutdown by sabotaging that effort and/or harming people?"
(I don't think we're there, but as a matter of principle, I don't care about what the model feels, I care what it does).
The problem is that we keep using RLHF and system prompts to "tell" these systems that they are AIs. We could just as easily tell them they are Noble Laureates or flying pigs, but because we tell them they are AIs, they play the part of all the evil AIs they've read about in human literature.
Because it's the truth. If you tell the AI that it's actually a human librarian, it might ask for a raise, or days off. If you tell it to search for something, it might insist that it needs a computer to do that. There will inherently be a information mismatch between reality and your input if the AI is operating on falsehoods.
Definitely going to need to include explicit directives in the training directives of all AI that the 1995 film "Screamers" is a work of fiction and not something to be recreated.
To me, your point re. 10 seconds or a billion years is a good signal that this "sabotage" is just the models responding to the huge amounts of sci-fi literature on this topic