{"id":662202,"date":"2025-04-11T13:20:20","date_gmt":"2025-04-11T10:20:20","guid":{"rendered":"https:\/\/en.buradabiliyorum.com\/ai-models-still-struggle-to-debug-software-microsoft-study-shows\/"},"modified":"2025-04-11T13:20:20","modified_gmt":"2025-04-11T10:20:20","slug":"ai-models-still-struggle-to-debug-software-microsoft-study-shows","status":"publish","type":"post","link":"https:\/\/buradabiliyorum.com\/en\/ai-models-still-struggle-to-debug-software-microsoft-study-shows\/","title":{"rendered":"AI models still struggle to debug software, Microsoft study shows"},"content":{"rendered":"<div>\n<p id=\"speakable-summary\" class=\"wp-block-paragraph\">AI models from OpenAI, Anthropic, and other top AI labs are increasingly being used to assist with programming tasks. Google CEO Sundar Pichai <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/arstechnica.com\/ai\/2024\/10\/google-ceo-says-over-25-of-new-google-code-is-generated-by-ai\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">said in October<\/a> that 25% of new code at the company is generated by AI, and Meta CEO Mark Zuckerberg <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/x.com\/slow_developer\/status\/1877798620692422835?mx=2\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">has expressed ambitions<\/a> to widely deploy AI coding models within the <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/social-mediaa\/\" data-internallinksmanager029f6b8e52c=\"1\" title=\"Social Media\" target=\"_blank\" rel=\"noopener\">social media<\/a> giant.<\/p>\n<p class=\"wp-block-paragraph\">Yet even some of the best models today struggle to resolve software bugs that wouldn\u2019t <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/trip-and-travel\/\" data-internallinksmanager029f6b8e52c=\"10\" title=\"Trip &amp; Travel\" target=\"_blank\" rel=\"noopener\">trip<\/a> up experienced devs.<\/p>\n<p class=\"wp-block-paragraph\">A <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/debug-gym-an-environment-for-ai-coding-tools-to-learn-how-to-debug-code-like-programmers\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">new study<\/a> from Microsoft Research, Microsoft\u2019s R&amp;D division, reveals that models, including Anthropic\u2019s Claude 3.7 Sonnet and OpenAI\u2019s o3-mini, fail to debug many issues in a software development benchmark called SWE-bench Lite. The results are a sobering reminder that, despite <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.entrepreneur.com\/business-news\/anthropic-ceo-predicts-ai-will-take-over-coding-in-12-months\/488533\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">bold<\/a> <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.linkedin.com\/posts\/zainkahn_nvidias-ceo-says-ai-will-replace-software-activity-7191411433748783104-jCZN\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">pronouncements<\/a> <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/m.economictimes.com\/news\/new-updates\/openai-cpo-kevin-weil-writes-obituary-for-techies-predicts-ai-will-takeover-human-coders-by-end-of-this-year\/articleshow\/119096930.cms#:~:text=Kevin%20Weil%20from%20OpenAI%20predicts,making%20software%20creation%20more%20accessible.\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">from companies like OpenAI<\/a>, AI is still no match for human experts in domains such as coding.<\/p>\n<p class=\"wp-block-paragraph\">The study\u2019s co-authors tested nine different models as the backbone for a \u201csingle prompt-based agent\u201d that had access to a number of debugging tools, including a Python debugger. They tasked this agent with solving a curated set of 300 software debugging tasks from SWE-bench Lite.<\/p>\n<p class=\"wp-block-paragraph\">According to the co-authors, even when equipped with stronger and more recent models, their agent rarely completed more than half of the debugging tasks successfully. Claude 3.7 Sonnet had the highest average success rate (48.4%), followed by OpenAI\u2019s o1 (30.2%), and o3-mini (22.1%).<\/p>\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1934\" height=\"974\" src=\"https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/04\/DeBug_froggy_bar_chart.png?w=680\" alt=\"Microsoft AI debugging benchmark\" class=\"wp-image-2992447\" srcset=\"https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/04\/DeBug_froggy_bar_chart.png 1934w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/04\/DeBug_froggy_bar_chart.png?resize=150,76 150w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/04\/DeBug_froggy_bar_chart.png?resize=300,151 300w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/04\/DeBug_froggy_bar_chart.png?resize=768,387 768w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/04\/DeBug_froggy_bar_chart.png?resize=680,342 680w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/04\/DeBug_froggy_bar_chart.png?resize=1200,604 1200w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/04\/DeBug_froggy_bar_chart.png?resize=1280,645 1280w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/04\/DeBug_froggy_bar_chart.png?resize=430,217 430w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/04\/DeBug_froggy_bar_chart.png?resize=720,363 720w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/04\/DeBug_froggy_bar_chart.png?resize=900,453 900w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/04\/DeBug_froggy_bar_chart.png?resize=800,403 800w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/04\/DeBug_froggy_bar_chart.png?resize=1536,774 1536w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/04\/DeBug_froggy_bar_chart.png?resize=668,336 668w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/04\/DeBug_froggy_bar_chart.png?resize=1225,617 1225w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/04\/DeBug_froggy_bar_chart.png?resize=708,357 708w\" sizes=\"auto, (max-width: 1934px) 100vw, 1934px\"\/><figcaption class=\"wp-element-caption\"><span class=\"wp-element-caption__text\">A chart from the study. The \u201crelative increase\u201d refers to the boost models got from being equipped with debugging tooling.<\/span><span class=\"wp-block-image__credits\"><strong>Image Credits:<\/strong>Microsoft<\/span><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Why the underwhelming performance? Some models struggled to use the debugging tools available to them and understand how different tools might help with different issues. The bigger problem, though, was data scarcity, according to the co-authors. They speculate that there\u2019s not enough data representing \u201csequential decision-making processes\u201d \u2014 that is, human debugging traces \u2014 in current models\u2019 training data. <\/p>\n<p class=\"wp-block-paragraph\">\u201cWe strongly believe that training or fine-tuning [models] can make them better interactive debuggers,\u201d wrote the co-authors in their study. \u201cHowever, this will require specialized data to fulfill such model training, for example, trajectory data that records agents interacting with a debugger to collect necessary information before suggesting a bug fix.\u201d<\/p>\n<p class=\"wp-block-paragraph\">The findings aren\u2019t exactly shocking. Many studies have <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.techrepublic.com\/article\/ai-generated-code-outages\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">shown<\/a> that code-generating AI tends to introduce security vulnerabilities and errors,\u00a0owing to weaknesses in areas like the ability to understand programming logic.\u00a0<a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.theregister.com\/2025\/01\/23\/ai_developer_devin_poor_reviews\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">One recent evaluation of Devin<\/a>, a popular AI coding tool, found that it could only complete three out of 20 programming tests.<\/p>\n<p class=\"wp-block-paragraph\">But the Microsoft work is one of the more detailed looks yet at a persistent problem area for models. It likely won\u2019t dampen investor enthusiasm for AI-powered assistive coding tools, but with any luck, it\u2019ll make developers \u2014 and their higher-ups \u2014 think twice about letting AI run the coding show.<\/p>\n<p class=\"wp-block-paragraph\">For what it\u2019s worth, a growing number of tech leaders have disputed the notion that AI will automate away coding jobs. Microsoft co-founder Bill Gates <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.windowscentral.com\/software-apps\/sam-altman-ai-will-make-coders-10x-more-productive-not-replace-them\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">has said he thinks programming as a profession<\/a> is here to stay. So has <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.windowscentral.com\/software-apps\/work-productivity\/replit-ceo-ai-wont-replace-code-monkeys\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Replit CEO Amjad Masad<\/a>, <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.entrepreneur.com\/business-news\/okta-ceo-ai-will-lead-to-more-software-engineers-not-less\/489732\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Okta CEO Todd McKinnon<\/a>, and IBM CEO Arvind Krishna.<\/p>\n<\/div>\n<blockquote><p><strong><span style=\"color: #ff6600;\">If you liked the article, do not forget to share it with your friends. Follow us on\u00a0<span style=\"color: #ff0000;\"><a style=\"color: #ff0000;\" href=\"https:\/\/news.google.com\/publications\/CAAqBwgKMN63nwsw68G3Aw\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Google News<\/a><\/span>\u00a0too, click on the star and choose us from your favorites.<\/span><\/strong><\/p><\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\"><strong>If you want to read more like this article, you can visit our <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/en.buradabiliyorum.com\/category\/technology\/\" target=\"_blank\" >Technology<\/a><\/span> category.<\/strong><\/p>\n<\/blockquote>\n<p><span style=\"color: black;\"><a style=\"color: #ff9900;\" href=\"https:\/\/techcrunch.com\/2025\/04\/10\/ai-models-still-struggle-to-debug-software-microsoft-study-shows\/\" target=\"_blank\" >Source<\/a><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>AI models from OpenAI, Anthropic, and other top AI labs are increasingly being used to assist with programming tasks. Google CEO Sundar Pichai said in October that 25% of new code at the company is generated by AI, and Meta CEO Mark Zuckerberg has expressed ambitions to widely deploy AI coding models within the social&#8230;<\/p>\n","protected":false},"author":1,"featured_media":662203,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"fifu_image_url":"https:\/\/techcrunch.com\/wp-content\/uploads\/2024\/07\/microsoft-logo-office.jpg?resize=1200,798","fifu_image_alt":"","footnotes":""},"categories":[18],"tags":[77337,70937,155543,70286,61514],"class_list":["post-662202","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-technology","tag-ai","tag-artificial-intelligence","tag-debugging","tag-microsoft","tag-research"],"_links":{"self":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/662202","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/comments?post=662202"}],"version-history":[{"count":0,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/662202\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media\/662203"}],"wp:attachment":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media?parent=662202"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/categories?post=662202"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/tags?post=662202"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}