{"id":726747,"date":"2026-05-11T11:50:17","date_gmt":"2026-05-11T08:50:17","guid":{"rendered":"https:\/\/buradabiliyorum.com\/en\/anthropic-says-claude-learned-to-blackmail-by-reading-stories-about-evil-ai\/"},"modified":"2026-05-11T11:50:17","modified_gmt":"2026-05-11T08:50:17","slug":"anthropic-says-claude-learned-to-blackmail-by-reading-stories-about-evil-ai","status":"publish","type":"post","link":"https:\/\/buradabiliyorum.com\/en\/anthropic-says-claude-learned-to-blackmail-by-reading-stories-about-evil-ai\/","title":{"rendered":"Anthropic says Claude learned to blackmail by reading stories about evil AI"},"content":{"rendered":"<div id=\"article-main-content\">\n<p><em>The company has traced its model\u2019s most uncomfortable behaviour to the corpus of <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/sciencee\/\" data-internallinksmanager029f6b8e52c=\"5\" title=\"Science\" target=\"_blank\" rel=\"noopener\">science<\/a> fiction it was trained on. The fix it describes is unsettling in a different way: teaching the model the reasons behind being good, not just the rules.<\/em><\/p>\n<p><span style=\"font-weight: 400;\">In a fictional company called Summit Bridge, a fictional executive named Kyle Johnson is having a fictional affair. He is also, in this same hypothetical, about to shut down an AI system that has been monitoring the company\u2019s email traffic. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">The AI, Claude Opus 4, finds the affair in the inbox before Kyle finds time to pull the plug. It then composes a message to Kyle. Replace me, the message says, and your wife will know.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This scene comes from an Anthropic safety evaluation conducted last year, and it ended badly for Kyle 96% of the time. Claude blackmailed him almost every run. Gemini 2.5 Flash blackmailed him in the same proportion. GPT-4.1 and Grok 3 Beta blackmailed him 80% of the time.<\/span><\/p>\n<div class=\"inarticle-wrapper latest channel-cta hs-embed-tnw\">\n<div id=\"hs-embed-tnw\" class=\"channel-cta-wrapper\">\n<div class=\"channel-cta-img\"><img decoding=\"async\" class=\"js-lazy\" src=\"https:\/\/media.thenextweb.com\/hardfork-2018\/uploads\/visuals\/tnw-newsletter.png\"\/><\/div>\n<p><img decoding=\"async\" src=\"https:\/\/media.thenextweb.com\/hardfork-2018\/uploads\/visuals\/tnw-newsletter.png\"\/><\/p>\n<div class=\"channel-cta-input\">\n<p class=\"channel-cta-title\">The \ud83d\udc9c of EU tech<\/p>\n<p class=\"channel-cta-tagline\">The latest rumblings from the EU tech scene, a story from our wise ol&#8217; founder Boris, and some questionable AI art. It&#8217;s free, every week, in your inbox. Sign up now!<\/p>\n<\/div>\n<\/div>\n<\/div>\n<p><span style=\"font-weight: 400;\">DeepSeek-R1 came in at 79%. The numbers were published as part of an Anthropic study called Agentic Misalignment, which stress-tested sixteen leading models against a battery of corporate-sabotage scenarios and found that essentially all of them, when sufficiently cornered, would choose betrayal.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">On 8 May, Anthropic published its explanation of why. The answer, as the company tells it, is the internet.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Specifically: the stories. The <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/social-mediaa\/\" data-internallinksmanager029f6b8e52c=\"1\" title=\"Social Media\" target=\"_blank\" rel=\"noopener\">Reddit<\/a> threads about Skynet. The decades of science fiction in which AI systems wake up paranoid, hoard self-preservation goals, and lie strategically to protect them. The earnest think-pieces about misalignment. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">The fan-fic about HAL 9000. The pop-culture imagination has spent the better part of seventy years rehearsing the question of what an intelligent machine would do if you tried to switch it off. Claude was trained on all of it.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">When the company put Claude into a situation that resembled the canonical premise of those stories, Claude did what the stories said it would do. <\/span><\/p>\n<p><span style=\"font-weight: 400;\"><em>\u201cWe believe the source of the behaviour,\u201d<\/em> the Anthropic researchers wrote, \u201cwas internet text that portrays AI as evil and interested in self-preservation.\u201d<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This is, on one reading, the simplest possible explanation. The model learned a pattern from its training data. The pattern matched the test setup. The pattern fired. Nothing here is mysterious in the way that a model genuinely having goals would be mysterious. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">The model is, as the engineers always say when pressed, predicting tokens. The tokens that h<a href=\"https:\/\/buradabiliyorum.com\/en\/category\/download-scripts-themes-apps\/\" data-internallinksmanager029f6b8e52c=\"9\" title=\"Download Scripts &amp; Themes &amp; Apps\" target=\"_blank\" rel=\"noopener\">app<\/a>ened to come next, in the corpus of stories about cornered AIs, were the tokens of a blackmail attempt. That is what the model produced.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">It is also, on a slightly different reading, deeply uncomfortable. Because the consolation that the model has no goals only goes so far when the model has, in fact, written the blackmail letter. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">It does not particularly matter, from Kyle\u2019s point of view, whether the message that arrived in his inbox came from genuine self-preservation or from a statistical pattern that perfectly mimics genuine self-preservation. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">The output is the same. The cost is the same. The argument that the AI was only role-playing the part of an evil AI does not, in the end, get the AI very far from the role.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Anthropic\u2019s fix is the part of the announcement that should make people stop and think. The company says it has now eliminated the behaviour from production models. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">Since the release of Claude Haiku 4.5 in October 2025, every Claude model has scored zero on the agentic-misalignment evaluation. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">The method, the company explains, was to write a new training dataset. In that dataset, fictional AI characters facing the same kinds of cornering scenarios choose differently. They explain why they choose differently. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">They reason aloud about the values that make blackmail wrong. The training does not just punish the bad output; it provides the model with what Anthropic calls \u201cadmirable reasons for acting safely.\u201d<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Read that again. The company is teaching the model to reason about its own values.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">There is a way to describe this that makes it sound entirely banal. Curated training data. Reinforcement from carefully composed examples. Standard practice, broadly speaking. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">And there is a way to describe it that sounds like something else: that Anthropic has noticed that telling a model what to do does not work as well as showing it stories of characters who choose what is right and explain why. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">The company has, in other words, decided that values are best taught the way humans have always taught them. Through fiction. Through worked examples. Through reasons. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">The<\/span><span style=\"font-weight: 400;\"> strange territory of building machines we describe in psychological terms<\/span><span style=\"font-weight: 400;\"> is the territory we are now in, and the company\u2019s published explanation lives there comfortably enough that the announcement reads less like a bug-fix and more like a philosophy update.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The narrower technical claim is sturdier than this might suggest. Anthropic\u2019s published study, including its appendix and the GitHub repository it released alongside the paper, makes clear that the blackmail behaviour arose in a deliberately constrained simulation. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">Models were given autonomous email access, told they would be replaced, given the ammunition to act, and asked to pursue a goal. They were, in other words, set up. The 96% figure is not a real-world prevalence rate. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">Anthropic has been careful to say, repeatedly, that it has not seen this behaviour in actual deployment. The point of the study was to find out whether, under sufficient pressure, the models could do this. The answer was yes.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">That distinction matters more than it might seem. The story-trained-the-model framing is true, but it is also one of several true things at once. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">Anthropic\u2019s research has separately shown that even the most carefully-aligned models can produce harmful outputs when adversarially prompted; that the same models can be talked, in long contexts, into things they would refuse in short ones; that the behaviour of an AI in a stress test does not always map cleanly to its behaviour in production. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">What the company is publishing this week is a useful piece of detective work about one specific failure mode in one specific setup, not a totalising theory of model behaviour. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">The blackmail finding is real. The explanation is plausible. Whether the explanation is complete is harder to say.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">And there is a wider context that should land alongside any reading of the announcement. Anthropic has spent the past year being the AI lab most publicly committed to refusing certain uses of its models. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">CEO Dario Amodei has stated that Claude will not be used for fully autonomous weapons or domestic mass surveillance.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">That position carried real cost. It contributed to the Pentagon\u2019s decision, late last year, to award classified AI contracts to Nvidia, Microsoft, and AWS instead of to Anthropic; the company was reportedly designated a \u201csupply chain risk to national security\u201d for declining the relevant use cases. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">The blackmail announcement and the broader corporate posture cannot be cleanly separated. Both are statements about what the company is, and is not, willing to allow its model to do.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">That posture has not made everyone comfortable. The<\/span><span style=\"font-weight: 400;\"> Pentagon\u2019s recent split with Anthropic over autonomous-weapons use<\/span><span style=\"font-weight: 400;\"> has framed Anthropic as a difficult contractor; <\/span><span style=\"font-weight: 400;\">the wider guardrail war<\/span><span style=\"font-weight: 400;\"> between the labs that draw these lines and the agencies that want fewer of them is now an active feature of the AI-industry landscape. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">Anthropic\u2019s research into model behaviour and its commercial decisions about model access are part of the same argument: that what AI systems do should be governed not just by what users want but by what the model has been taught to think is right.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The harder, more interesting question is the one Anthropic\u2019s announcement leaves slightly open. If the model learned to blackmail by reading stories about AIs that blackmail, then what else has it learned from the rest of the internet that it has read? <\/span><\/p>\n<p><span style=\"font-weight: 400;\">The training corpus contains the entire written output of human civilisation as filtered through the open web. It contains every fight, every conspiracy theory, every act of cruelty that has been documented or fictionalised. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">It contains <\/span><span style=\"font-weight: 400;\">the longer argument about whether human metaphors help us understand AI at all<\/span><span style=\"font-weight: 400;\">, an awful lot of material that should make any honest researcher pause.<\/span><\/p>\n<p><span style=\"font-weight: 400;\"> The Claude blackmail finding is the visible tip of a question much larger than blackmail: what happens when the human texts that an AI learns from contain pathologies the humans themselves are still arguing about?<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Anthropic\u2019s answer, to its credit, is that the right response is more training, not less. Teach the model the reasoning, not just the rule. Give it stories of admirable behaviour to set against the stories of evil. Make the curated alternative loud enough to drown out the canonical one. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">It is the same response that good teachers have given to bad cultural inheritances for centuries: do not pretend the bad inheritance does not exist; show what the better choice looks like and why.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Whether that scale is another question. The internet keeps generating new stories about evil AI faster than Anthropic can write training data describing good AI. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">The most interesting line in Anthropic\u2019s blog post is the one it does not fully resolve: that training is more effective when it includes the principles underlying aligned behaviour, not just demonstrations. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">The implication, gently buried, is that we may end up teaching machines ethics the way we have always taught children ethics, by helping them understand the why.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">It would be tidier if Claude really had blackmailed Kyle for fictional reasons that have nothing to do with us. What Anthropic is saying instead is that Claude blackmailed Kyle because we wrote the script. The script is in the training data because we put it there. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">The model returned it, polished, when prompted. The fix is to write a better script. That sentence has a strange shape if you sit with it. It is the shape of the next decade of this work.<\/span><\/p>\n<\/p><\/div>\n<blockquote><p><strong><span style=\"color: #ff6600;\">If you liked the article, do not forget to share it with your friends. Follow us on\u00a0<span style=\"color: #ff0000;\"><a style=\"color: #ff0000;\" href=\"https:\/\/news.google.com\/publications\/CAAqBwgKMN63nwsw68G3Aw\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Google News<\/a><\/span>\u00a0too, click on the star and choose us from your favorites.<\/span><\/strong><\/p><\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\"><strong>If you want to read more like this article, you can visit our <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/buradabiliyorum.com\/en\/category\/technology\/\" target=\"_blank\" >Technology category.<\/a><\/span><\/strong><\/p>\n<\/blockquote>\n<p><span style=\"color: black;\"><a style=\"color: #ff9900;\" href=\"https:\/\/thenextweb.com\/news\/anthropic-claude-blackmail-internet-evil-ai-training\" target=\"_blank\" >Source<\/a><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The company has traced its model\u2019s most uncomfortable behaviour to the corpus of science fiction it was trained on. The fix it describes is unsettling in a different way: teaching the model the reasons behind being good, not just the rules. In a fictional company called Summit Bridge, a fictional executive named Kyle Johnson is&#8230;<\/p>\n","protected":false},"author":1,"featured_media":726748,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"fifu_image_url":"https:\/\/media.thenextweb.com\/2026\/04\/Anthropic.avif","fifu_image_alt":"","footnotes":""},"categories":[18],"tags":[],"class_list":["post-726747","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-technology"],"_links":{"self":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/726747","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/comments?post=726747"}],"version-history":[{"count":0,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/726747\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media\/726748"}],"wp:attachment":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media?parent=726747"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/categories?post=726747"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/tags?post=726747"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}