{"id":662700,"date":"2025-04-15T01:51:45","date_gmt":"2025-04-14T22:51:45","guid":{"rendered":"https:\/\/en.buradabiliyorum.com\/debates-over-ai-benchmarking-have-reached-pokemon\/"},"modified":"2025-04-15T01:51:45","modified_gmt":"2025-04-14T22:51:45","slug":"debates-over-ai-benchmarking-have-reached-pokemon","status":"publish","type":"post","link":"https:\/\/buradabiliyorum.com\/en\/debates-over-ai-benchmarking-have-reached-pokemon\/","title":{"rendered":"Debates over AI benchmarking have reached Pok\u00e9mon"},"content":{"rendered":"<div>\n<p id=\"speakable-summary\" class=\"wp-block-paragraph\">Not even Pok\u00e9mon is safe from AI benchmarking controversy. <\/p>\n<p class=\"wp-block-paragraph\">Last week, a <a rel=\"nofollow\" target=\"_blank\" rel=\"nofollow\" href=\"https:\/\/x.com\/Jush21e8\/status\/1910293595422413051\">post on X<\/a> went viral, claiming that Google\u2019s latest Gemini model surpassed Anthropic\u2019s flagship Claude model in the original Pok\u00e9mon video <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/game\/\" data-internallinksmanager029f6b8e52c=\"7\" title=\"Game\" target=\"_blank\" rel=\"noopener\">game<\/a> trilogy. Reportedly, Gemini had reached Lavendar Town in a developer\u2019s Twitch stream; Claude was stuck at Mount Moon as of late February.<\/p>\n<blockquote class=\"wp-block-quote twitter-tweet is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">Gemini is literally ahead of Claude atm in pokemon after reaching Lavender Town<\/p>\n<p class=\"wp-block-paragraph\">119 live views only btw, incredibly underrated stream <a rel=\"nofollow\" target=\"_blank\" rel=\"nofollow\" href=\"https:\/\/t.co\/8AvSovAI4x\">pic.twitter.com\/8AvSovAI4x<\/a><\/p>\n<p class=\"wp-block-paragraph\">\u2014 Jush (@Jush21e8) <a rel=\"nofollow\" target=\"_blank\" rel=\"nofollow\" href=\"https:\/\/twitter.com\/Jush21e8\/status\/1910293595422413051?ref_src=twsrc%5Etfw\">April 10, 2025<\/a><\/p>\n<\/blockquote>\n<p class=\"wp-block-paragraph\">But what the post failed to mention is that Gemini had an advantage.<\/p>\n<p class=\"wp-block-paragraph\">As <a rel=\"nofollow\" target=\"_blank\" rel=\"nofollow\" href=\"https:\/\/www.reddit.com\/r\/singularity\/comments\/1jvwqc9\/gemini_plays_pok%C3%A9mon_has_made_it_through_rock\/\">users on Reddit<\/a> pointed out, the developer who maintains the Gemini stream built a custom minimap that helps the model identify \u201ctiles\u201d in the game like cuttable trees. This reduces the need for Gemini to analyze screenshots before it makes gameplay decisions.<\/p>\n<p class=\"wp-block-paragraph\">Now, Pok\u00e9mon is a semi-serious AI benchmark at best \u2014 few would argue it\u2019s a very informative test of a model\u2019s capabilities. But it <em>is<\/em> an instructive example of how different implementations of a benchmark can influence the results.<\/p>\n<p class=\"wp-block-paragraph\">For example, Anthropic <a rel=\"nofollow\" target=\"_blank\" rel=\"nofollow\" href=\"https:\/\/www.anthropic.com\/news\/claude-3-7-sonnet\">reported<\/a> two scores for its recent Anthropic 3.7 Sonnet model on the benchmark SWE-bench Verified, which is designed to evaluate a model\u2019s coding abilities. Claude 3.7 Sonnet achieved 62.3% accuracy on SWE-bench Verified, but 70.3% with a \u201ccustom scaffold\u201d that Anthropic developed.<\/p>\n<p class=\"wp-block-paragraph\">More recently, Meta fine-tuned a version of one of its newer models, Llama 4 Maverick, to perform well on a particular benchmark, LM Arena. The vanilla version of the model scores significantly worse on the same evaluation.<\/p>\n<p class=\"wp-block-paragraph\">Given that AI benchmarks \u2014 Pok\u00e9mon included \u2014 are imperfect measures to begin with, custom and non-standard implementations threaten to muddy the waters even further. That is to say, it doesn\u2019t seem likely that it\u2019ll get any easier to compare models as they\u2019re released.<\/p>\n<\/div>\n<p><script async src=\"\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script><\/p>\n<blockquote><p><strong><span style=\"color: #ff6600;\">If you liked the article, do not forget to share it with your friends. Follow us on\u00a0<span style=\"color: #ff0000;\"><a style=\"color: #ff0000;\" href=\"https:\/\/news.google.com\/publications\/CAAqBwgKMN63nwsw68G3Aw\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Google News<\/a><\/span>\u00a0too, click on the star and choose us from your favorites.<\/span><\/strong><\/p><\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\"><strong>If you want to read more like this article, you can visit our <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/en.buradabiliyorum.com\/category\/technology\/\" target=\"_blank\" >Technology<\/a><\/span> category.<\/strong><\/p>\n<\/blockquote>\n<p><span style=\"color: black;\"><a style=\"color: #ff9900;\" href=\"https:\/\/techcrunch.com\/2025\/04\/14\/debates-over-ai-benchmarking-have-reached-pokemon\/\" target=\"_blank\" >Source<\/a><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Not even Pok\u00e9mon is safe from AI benchmarking controversy. Last week, a post on X went viral, claiming that Google\u2019s latest Gemini model surpassed Anthropic\u2019s flagship Claude model in the original Pok\u00e9mon video game trilogy. Reportedly, Gemini had reached Lavendar Town in a developer\u2019s Twitch stream; Claude was stuck at Mount Moon as of late&#8230;<\/p>\n","protected":false},"author":1,"featured_media":662702,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"fifu_image_url":"https:\/\/techcrunch.com\/wp-content\/uploads\/2019\/01\/pokemon.png?resize=1200,674","fifu_image_alt":"","footnotes":""},"categories":[18],"tags":[77337,153713,33597],"class_list":["post-662700","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-technology","tag-ai","tag-benchmarks","tag-pokemon"],"_links":{"self":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/662700","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/comments?post=662700"}],"version-history":[{"count":0,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/662700\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media\/662702"}],"wp:attachment":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media?parent=662700"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/categories?post=662700"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/tags?post=662700"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}