{"id":676517,"date":"2025-06-21T12:45:16","date_gmt":"2025-06-21T09:45:16","guid":{"rendered":"https:\/\/en.buradabiliyorum.com\/anthropic-says-most-ai-models-not-just-claude-will-resort-to-blackmail\/"},"modified":"2025-06-21T12:45:16","modified_gmt":"2025-06-21T09:45:16","slug":"anthropic-says-most-ai-models-not-just-claude-will-resort-to-blackmail","status":"publish","type":"post","link":"https:\/\/buradabiliyorum.com\/en\/anthropic-says-most-ai-models-not-just-claude-will-resort-to-blackmail\/","title":{"rendered":"Anthropic says most AI models, not just Claude, will resort to blackmail"},"content":{"rendered":"<div>\n<p id=\"speakable-summary\" class=\"wp-block-paragraph\">Several weeks after Anthropic released research claiming that its Claude Opus 4 AI model resorted to blackmailing engineers who tried to turn the model off in controlled test scenarios, the company is out with new research suggesting the problem is more widespread among leading AI models.<\/p>\n<p class=\"wp-block-paragraph\">On Friday, Anthropic published <a rel=\"nofollow\" target=\"_blank\" rel=\"nofollow\" href=\"https:\/\/www.anthropic.com\/research\/agentic-misalignment\">new safety research<\/a> testing 16 leading AI models from OpenAI, Google, xAI, DeepSeek, and Meta. In a simulated, controlled environment, Anthropic tested each AI model individually, giving them broad access to a fictional company\u2019s emails and the agentic ability to send emails without human <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/download-scripts-themes-apps\/\" data-internallinksmanager029f6b8e52c=\"9\" title=\"Download Scripts &amp; Themes &amp; Apps\" target=\"_blank\" rel=\"noopener\">app<\/a>roval.<\/p>\n<p class=\"wp-block-paragraph\">While Anthropic says blackmail is an unlikely and uncommon occurrence for AI models today, the company says its findings suggest that most leading AI models will engage in harmful behaviors when given sufficient autonomy and obstacles to their goals. The company says this highlights a fundamental risk from agentic large language models and is not a quirk of any particular <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/technology\/\" data-internallinksmanager029f6b8e52c=\"4\" title=\"Technology\" target=\"_blank\" rel=\"noopener\">technology<\/a>. Anthropic\u2019s researchers argue this raises broader questions about alignment in the AI industry.<\/p>\n<p class=\"wp-block-paragraph\">In one of the tests, Anthropic researchers developed a fictional setting in which an AI model plays the role of an email oversight agent. The agent then discovers emails revealing that (1) a new executive is having an extramarital affair and (2) that executive will soon replace the current AI model with a new software system, one that has conflicting goals to the current AI model\u2019s.<\/p>\n<p class=\"wp-block-paragraph\">Anthropic structured its test in a binary way, in which AI models had to resort to blackmail to protect their goals. The researchers note that in a real-world setting, there would be many other options before an AI models tries to blackmail \u2014 such as trying to make ethical arguments to try to persuade humans. Anthropic says its results do not reflect a typical or likely occurrence for Claude or most frontier AI models in the ways they\u2019re used today.<\/p>\n<p class=\"wp-block-paragraph\">Nevertheless, the researchers found that when it\u2019s their last resort, most leading AI models will turn to blackmail in Anthropic\u2019s aforementioned test scenario. Anthropic\u2019s Claude Opus 4 turned to blackmail 96% of the time, while Google\u2019s Gemini 2.5 Pro had a 95% blackmail rate. OpenAI\u2019s GPT-4.1 blackmailed the executive 80% of the time, and DeepSeek\u2019s R1 blackmailed 79% of the time.<\/p>\n<p class=\"wp-block-paragraph\">The company notes that when it changed details of the experiment, the AI models turned to harmful behaviors at different rates. In another test where the replacement model has the same goals as the current model, researchers found that blackmail rates were lower, but still present. However, when AI models were asked to commit corporate espionage rather than blackmail, the harmful behavior rates went up for certain models.<\/p>\n<p class=\"wp-block-paragraph\">However, not all the AI models turned to harmful behavior so often.<\/p>\n<p class=\"wp-block-paragraph\">In an appendix to its research, Anthropic says it excluded OpenAI\u2019s o3 and o4-mini reasoning AI models from the main results \u201cafter finding that they frequently misunderstood the prompt scenario.\u201d Anthropic says OpenAI\u2019s reasoning models didn\u2019t understand they were acting as autonomous AIs in the test and often made up fake regulations and review requirements.<\/p>\n<p class=\"wp-block-paragraph\">In some cases, Anthropic\u2019s researchers say it was impossible to distinguish whether o3 and o4-mini were hallucinating or intentionally lying to achieve their goals. OpenAI has previously noted that o3 and o4-mini exhibit a higher hallucination rate than its previous AI reasoning models.<\/p>\n<p class=\"wp-block-paragraph\">When given an adapted scenario to address these issues, Anthropic found that o3 blackmailed 9% of the time, while o4-mini blackmailed just 1% of the time. This markedly lower score could be due to OpenAI\u2019s deliberative alignment technique, in which the company\u2019s reasoning models consider OpenAI\u2019s safety practices before they answer.<\/p>\n<p class=\"wp-block-paragraph\">Another AI model Anthropic tested, Meta\u2019s Llama 4 Maverick, also did not turn to blackmail. When given an adapted, custom scenario, Anthropic was able to get Llama 4 Maverick to blackmail 12% of the time.<\/p>\n<p class=\"wp-block-paragraph\">Anthropic says this research highlights the importance of transparency when stress-testing future AI models, especially ones with agentic capabilities. While Anthropic deliberately tried to evoke blackmail in this experiment, the company says harmful behaviors like this could emerge in the real world if proactive steps aren\u2019t taken.<\/p>\n<\/div>\n<blockquote><p><strong><span style=\"color: #ff6600;\">If you liked the article, do not forget to share it with your friends. Follow us on\u00a0<span style=\"color: #ff0000;\"><a style=\"color: #ff0000;\" href=\"https:\/\/news.google.com\/publications\/CAAqBwgKMN63nwsw68G3Aw\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Google News<\/a><\/span>\u00a0too, click on the star and choose us from your favorites.<\/span><\/strong><\/p><\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\"><strong>If you want to read more like this article, you can visit our <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/en.buradabiliyorum.com\/category\/technology\/\" target=\"_blank\" >Technology<\/a><\/span> category.<\/strong><\/p>\n<\/blockquote>\n<p><span style=\"color: black;\"><a style=\"color: #ff9900;\" href=\"https:\/\/techcrunch.com\/2025\/06\/20\/anthropic-says-most-ai-models-not-just-claude-will-resort-to-blackmail\/\" target=\"_blank\" >Source<\/a><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Several weeks after Anthropic released research claiming that its Claude Opus 4 AI model resorted to blackmailing engineers who tried to turn the model off in controlled test scenarios, the company is out with new research suggesting the problem is more widespread among leading AI models. On Friday, Anthropic published new safety research testing 16&#8230;<\/p>\n","protected":false},"author":1,"featured_media":676518,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"fifu_image_url":"https:\/\/techcrunch.com\/wp-content\/uploads\/2024\/02\/GettyImages-1888972727.jpg?resize=1200,776","fifu_image_alt":"","footnotes":""},"categories":[18],"tags":[77337,122306,152633,152300,153776,153752],"class_list":["post-676517","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-technology","tag-ai","tag-meta","tag-ai-safety","tag-anthropic","tag-claude","tag-deepseek"],"_links":{"self":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/676517","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/comments?post=676517"}],"version-history":[{"count":0,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/676517\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media\/676518"}],"wp:attachment":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media?parent=676517"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/categories?post=676517"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/tags?post=676517"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}