{"id":658724,"date":"2025-03-25T11:25:16","date_gmt":"2025-03-25T08:25:16","guid":{"rendered":"https:\/\/en.buradabiliyorum.com\/a-new-challenging-agi-test-stumps-most-ai-models\/"},"modified":"2025-03-25T11:25:16","modified_gmt":"2025-03-25T08:25:16","slug":"a-new-challenging-agi-test-stumps-most-ai-models","status":"publish","type":"post","link":"https:\/\/buradabiliyorum.com\/en\/a-new-challenging-agi-test-stumps-most-ai-models\/","title":{"rendered":"A new, challenging AGI test stumps most AI models"},"content":{"rendered":"<div>\n<p id=\"speakable-summary\" class=\"wp-block-paragraph\">The Arc Prize Foundation, a nonprofit co-founded by prominent AI researcher Fran\u00e7ois Chollet, announced in a <a rel=\"nofollow\" target=\"_blank\" rel=\"nofollow\" href=\"https:\/\/arcprize.org\/blog\/announcing-arc-agi-2-and-arc-prize-2025\">blog post<\/a> on Monday that it has created a new, challenging test to measure the <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/general\/\" data-internallinksmanager029f6b8e52c=\"3\" title=\"General\" target=\"_blank\" rel=\"noopener\">general<\/a> intelligence of leading AI models.<\/p>\n<p class=\"wp-block-paragraph\">So far, the new test, called ARC-AGI-2, has stumped most models.<\/p>\n<p class=\"wp-block-paragraph\">\u201cReasoning\u201d AI models like OpenAI\u2019s o1-pro and DeepSeek\u2019s R1 score between 1% and 1.3% on ARC-AGI-2, according to the <a rel=\"nofollow\" target=\"_blank\" rel=\"nofollow\" href=\"https:\/\/arcprize.org\/leaderboard\">Arc Prize leaderboard<\/a>. Powerful non-reasoning models including GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Flash score around 1%.<\/p>\n<p class=\"wp-block-paragraph\">The ARC-AGI tests consist of puzzle-like problems where an AI has to identify visual patterns from a collection of different-colored squares, and generate the correct \u201canswer\u201d grid. The problems were designed to force an AI to adapt to new problems it hasn\u2019t seen before. <\/p>\n<p class=\"wp-block-paragraph\">The Arc Prize Foundation had over 400 people take ARC-AGI-2 to establish a human baseline. On average, \u201cpanels\u201d of these people got 60% of the test\u2019s questions right \u2014 much better than any of the models\u2019 scores.<\/p>\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1624\" height=\"786\" src=\"https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.16.48PM.png?w=680\" alt=\"\" class=\"wp-image-2985527\" srcset=\"https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.16.48PM.png 1624w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.16.48PM.png?resize=150,73 150w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.16.48PM.png?resize=300,145 300w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.16.48PM.png?resize=768,372 768w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.16.48PM.png?resize=680,329 680w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.16.48PM.png?resize=1200,581 1200w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.16.48PM.png?resize=1280,620 1280w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.16.48PM.png?resize=430,208 430w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.16.48PM.png?resize=720,348 720w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.16.48PM.png?resize=900,436 900w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.16.48PM.png?resize=800,387 800w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.16.48PM.png?resize=1536,743 1536w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.16.48PM.png?resize=668,323 668w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.16.48PM.png?resize=1275,617 1275w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.16.48PM.png?resize=708,343 708w\" sizes=\"auto, (max-width: 1624px) 100vw, 1624px\"\/><figcaption class=\"wp-element-caption\"><span class=\"wp-element-caption__text\">a sample question from Arc-AGI-2 (credit: Arc Prize).<\/span><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">In a <a rel=\"nofollow\" target=\"_blank\" rel=\"nofollow\" href=\"https:\/\/x.com\/fchollet\/status\/1904265979192086882\">post on X<\/a>, Chollet claimed ARC-AGI-2 is a better measure of an AI model\u2019s actual intelligence than the first iteration of the test, ARC-AGI-1. The Arc Prize Foundation\u2019s tests are aimed at evaluating whether an AI system can efficiently acquire new skills outside the data it was trained on.<\/p>\n<p class=\"wp-block-paragraph\">Chollet said that unlike ARC-AGI-1, the new test prevents AI models from relying on \u201cbrute force\u201d \u2014 extensive computing power \u2014 to find solutions. Chollet previously acknowledged this was a major flaw of ARC-AGI-1.<\/p>\n<p class=\"wp-block-paragraph\">To address the first test\u2019s flaws, ARC-AGI-2 introduces a new metric: efficiency. It also requires models to interpret patterns on the fly instead of relying on memorization.<\/p>\n<p class=\"wp-block-paragraph\">\u201cIntelligence is not solely defined by the ability to solve problems or achieve high scores,\u201d Arc Prize Foundation co-founder Greg Kamradt wrote in a <a rel=\"nofollow\" target=\"_blank\" rel=\"nofollow\" href=\"https:\/\/arcprize.org\/blog\/announcing-arc-agi-2-and-arc-prize-2025\">blog post<\/a>. \u201cThe efficiency with which those capabilities are acquired and deployed is a crucial, defining component. The core question being asked is not just, \u2018Can AI acquire [the] skill to solve a task?\u2019 but also, \u2018At what efficiency or cost?\u2019\u201d<\/p>\n<p class=\"wp-block-paragraph\">ARC-AGI-1 was unbeaten for roughly five years until December 2024, when OpenAI released its advanced reasoning model, o3, which outperformed all other AI models and matched human performance on the evaluation. However, as we noted at the time, o3\u2019s performance gains on ARC-AGI-1 came with a hefty price tag.<\/p>\n<p class=\"wp-block-paragraph\">The version of OpenAI\u2019s o3 model \u2014 o3 (low) \u2014 that was first to reach new heights on ARC-AGI-1, scoring 75.7% on the test, got a measly 4% on ARC-AGI-2 using $200 worth of computing power per task.<\/p>\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1602\" height=\"902\" src=\"https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png?w=680\" alt=\"\" class=\"wp-image-2985529\" srcset=\"https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png 1602w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png?resize=150,84 150w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png?resize=300,169 300w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png?resize=768,432 768w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png?resize=680,383 680w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png?resize=1200,676 1200w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png?resize=1280,721 1280w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png?resize=430,242 430w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png?resize=720,405 720w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png?resize=900,507 900w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png?resize=800,450 800w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png?resize=1536,865 1536w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png?resize=668,376 668w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png?resize=666,375 666w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png?resize=1096,617 1096w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-24-at-3.18.29PM.png?resize=708,399 708w\" sizes=\"auto, (max-width: 1602px) 100vw, 1602px\"\/><figcaption class=\"wp-element-caption\"><span class=\"wp-element-caption__text\">Comparison of Frontier AI model performance on ARC-AGI-1 and ARC-AGI-2 (credit: Arc Prize).<\/span><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The arrival of ARC-AGI-2 comes as many in the tech industry are calling for new, unsaturated benchmarks to measure AI progress. Hugging Face\u2019s co-founder, Thomas Wolf, recently told TechCrunch that the AI industry lacks sufficient tests to measure the key traits of so-called artificial general intelligence, including creativity.<\/p>\n<p class=\"wp-block-paragraph\">Alongside the new benchmark, the Arc Prize Foundation announced <a rel=\"nofollow\" target=\"_blank\" rel=\"nofollow\" href=\"https:\/\/arcprize.org\/competition\">a new Arc Prize 2025 contest<\/a>, challenging developers to reach 85% accuracy on the ARC-AGI-2 test while only spending $0.42 per task.<\/p>\n<\/div>\n<blockquote><p><strong><span style=\"color: #ff6600;\">If you liked the article, do not forget to share it with your friends. Follow us on\u00a0<span style=\"color: #ff0000;\"><a style=\"color: #ff0000;\" href=\"https:\/\/news.google.com\/publications\/CAAqBwgKMN63nwsw68G3Aw\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Google News<\/a><\/span>\u00a0too, click on the star and choose us from your favorites.<\/span><\/strong><\/p><\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\"><strong>If you want to read more like this article, you can visit our <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/en.buradabiliyorum.com\/category\/technology\/\" target=\"_blank\" >Technology<\/a><\/span> category.<\/strong><\/p>\n<\/blockquote>\n<p><span style=\"color: black;\"><a style=\"color: #ff9900;\" href=\"https:\/\/techcrunch.com\/2025\/03\/24\/a-new-challenging-agi-test-stumps-most-ai-models\/\" target=\"_blank\" >Source<\/a><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The Arc Prize Foundation, a nonprofit co-founded by prominent AI researcher Fran\u00e7ois Chollet, announced in a blog post on Monday that it has created a new, challenging test to measure the general intelligence of leading AI models. So far, the new test, called ARC-AGI-2, has stumped most models. \u201cReasoning\u201d AI models like OpenAI\u2019s o1-pro and&#8230;<\/p>\n","protected":false},"author":1,"featured_media":658725,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"fifu_image_url":"https:\/\/techcrunch.com\/wp-content\/uploads\/2024\/03\/GettyImages-1708266672.jpg?resize=1200,1029","fifu_image_alt":"","footnotes":""},"categories":[18],"tags":[77337,151633],"class_list":["post-658724","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-technology","tag-ai","tag-agi"],"_links":{"self":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/658724","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/comments?post=658724"}],"version-history":[{"count":0,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/658724\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media\/658725"}],"wp:attachment":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media?parent=658724"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/categories?post=658724"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/tags?post=658724"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}