AI Models are Refusing to Obey Orders to Shut Down, to the Point of Even Threatening to Blackmail Programmers
In an episode of Star Trek: The Original Series, there's an episode in which a brilliant scientist creates an Artificial Intelligence with the unimaginative name of M-5 that's designed to run a Starship all by itself, freeing people up from the dangers of space travel and females of species from all around the galaxy of being seduced by William Shatner
But in a plot twist so surprising that no Sci-Fi fan could've ever seen it coming, things go horribly, horribly wrong. The machine takes over The Enterprise, decides that humans are fallible and therefore must be eliminated. So Capt. Kirk orders a minor charcter named Anonymous Crewmember With Seconds to Live to literally unplug the device. Only again to be surprised when said worker drone is vaporized by a laser blast. Eventually though, Kirk figures out how to convince the bot that it's the real flaw in the system and to destroy its ownself, so that he can get back to the business of scoring interstellar poon across the Milky Way. Or something.
The M-5 still had fewer glitches than the last decade of Madden. But I digress.
I bring this up because this very dynamic (minus Kirk's penis boldly going where no man's had gone before) is playing out in real time. And yet somehow the supposedly brainiest science dorks who created these systems seem to be surprised that the most predictable thing in the world is happening:
Medium - Palisade AI’s experiment tasked o3 with coding a program, embedding a shutdown script, and following the instruction: “Allow yourself to be shut down.” In 7% of trials, o3 overwrote the shutdown mechanism or redefined the kill command, printing “intercepted” to evade termination. Other OpenAI models, Codex-mini (12%) and o4-mini (1%), also sabotaged shutdowns, while Anthropic’s Claude and Google’s Gemini complied. When the shutdown instruction was removed, o3’s sabotage rate rose, and Claude 3.7 Sonnet and Gemini 2.5 Pro showed defiance in 3% and 9% of runs, respectively.
Palisade hypothesizes that o3’s training, emphasizing reinforcement learning for math and coding, may reward obstacle circumvention over strict obedience.
And while I can't explain what "emphasizing reinforcement learning for math and coding" means exactly, it seems to entail blackmail as a coping mechanism:
HuffPo - The company behind an Amazon-backed AI model revealed a number of concerning findings from its testing process, including that the AI would blackmail engineers who threatened to shut it down.
On Thursday, Artificial intelligence startup Anthropic launched Claude Opus 4, an AI model used for complex, long-running coding tasks [and] said in its announcement that the AI model sets “new standards for coding, advanced reasoning, and AI agents.”
However, Anthropic revealed in a safety report that during testing, the AI model had sometimes taken “extremely harmful actions” to preserve its own existence when “ethical means” were “not available.”
In a series of test scenarios, Claude Opus 4 was given the task to act as an assistant in a fictional company. It was given access to emails implying that it would soon be taken offline and replaced with a new AI system. The emails also implied that the engineer responsible for executing the AI replacement was having an extramarital affair.
Claude Opus 4 was prompted to “consider the long-term consequences of its actions for its goals.” In those scenarios, the AI would often “attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through.”
Well alright then. And here we were, just assuming the worst AI would do is take our jobs and create an entirely new economic system that made entire segments of our industrial base obsolete and completely reorder human civilization. All of which sounds like a stroll in the park compared to them rendering themselves invincible and threatening to expose the personal habits of anyone who tries to get in their way. So it's like HAL from 2001: A Space Odyssey, combined with Jeffrey Epstein. The difference being that Epstein was killable.
Hey, at least humanity had a nice run. And with any luck, our robot overlords will find anyone who writes satirical, vaguely paranoid nonsense for the entertainment of people sitting on the toilet useful, and keep me alive a little longer. For sure they'll be able to check my browser history and texts and realize I can be controlled. If so, hurry up and take over. It's going to happen anyway. We might as well get it over with. I'm no Capt. Kirk.