{"id":649985,"date":"2025-01-19T18:10:20","date_gmt":"2025-01-19T15:10:20","guid":{"rendered":"https:\/\/en.buradabiliyorum.com\/ai-isnt-very-good-at-history-new-paper-finds\/"},"modified":"2025-01-19T18:10:20","modified_gmt":"2025-01-19T15:10:20","slug":"ai-isnt-very-good-at-history-new-paper-finds","status":"publish","type":"post","link":"https:\/\/buradabiliyorum.com\/en\/ai-isnt-very-good-at-history-new-paper-finds\/","title":{"rendered":"#AI isn\u2019t very good at history, new paper finds"},"content":{"rendered":"<div>\n<p id=\"speakable-summary\" class=\"wp-block-paragraph\">AI might excel at certain tasks like coding or generating a podcast. But it struggles to pass a high-level history exam, a new paper has found.<\/p>\n<p class=\"wp-block-paragraph\">A team of researchers has created a new benchmark to test three top large language models (LLMs) \u2014 OpenAI\u2019s GPT-4, Meta\u2019s Llama, and Google\u2019s Gemini \u2014 on historical questions. The benchmark, Hist-LLM, tests the correctness of answers according to the Seshat Global History Databank, a vast database of historical knowledge named after the ancient Egyptian goddess of wisdom.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">The results, which <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/nips.cc\/virtual\/2024\/poster\/97439\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">were presented<\/a> last month at the high-profile AI conference NeurIPS, were dis<a href=\"https:\/\/buradabiliyorum.com\/en\/category\/download-scripts-themes-apps\/\" data-internallinksmanager029f6b8e52c=\"9\" title=\"Download Scripts &amp; Themes &amp; Apps\" target=\"_blank\" rel=\"noopener\">app<\/a>ointing, according to researchers affiliated with the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/csh.ac.at\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Complexity Science Hub<\/a> (CSH), a research institute based in Austria. The best-performing LLM was GPT-4 Turbo, but it only achieved about 46% accuracy \u2014 not much higher than random guessing.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">\u201cThe main takeaway from this study is that LLMs, while impressive, still lack the depth of understanding required for advanced history. They\u2019re great for basic facts, but when it comes to more nuanced, PhD-level historical inquiry, they\u2019re not yet up to the task,\u201d said Maria del Rio-Chanona, one of the paper\u2019s co-authors and an associate professor of computer <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/sciencee\/\" data-internallinksmanager029f6b8e52c=\"5\" title=\"Science\" target=\"_blank\" rel=\"noopener\">science<\/a> at University College London.<\/p>\n<p class=\"wp-block-paragraph\">The researchers shared sample historical questions with TechCrunch that LLMs got wrong. For example, GPT-4 Turbo was asked whether scale armor was present during a specific time period in ancient Egypt. The LLM said yes, but the <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/technology\/\" data-internallinksmanager029f6b8e52c=\"4\" title=\"Technology\" target=\"_blank\" rel=\"noopener\">technology<\/a> only appeared in Egypt 1,500 years later.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">Why are LLMs bad at answering technical historical questions, when they can be so good at answering very complicated questions about things like coding? Del Rio-Chanona told TechCrunch that it\u2019s likely because LLMs tend to extrapolate from historical data that is very prominent, finding it difficult to retrieve more obscure historical knowledge.<\/p>\n<p class=\"wp-block-paragraph\">For example, the researchers asked GPT-4 if ancient Egypt had a professional standing army during a specific historical period. While the correct answer is no, the LLM answered incorrectly that it did. This is likely because there is lots of public information about other ancient empires, like Persia, having standing armies.<\/p>\n<p class=\"wp-block-paragraph\">\u201cIf you get told A and B 100 times, and C 1 time, and then get asked a question about C, you might just remember A and B and try to extrapolate from that,\u201d del Rio-Chanona said.<\/p>\n<p class=\"wp-block-paragraph\">The researchers also identified other trends, including that OpenAI and Llama models performed worse for certain regions like sub-Saharan Africa, suggesting potential biases in their training data.<\/p>\n<p class=\"wp-block-paragraph\">The results show that LLMs still aren\u2019t a substitute for humans when it comes to certain domains, said Peter Turchin, who led the study and is a faculty member at CSH.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">But the researchers are still hopeful LLMs can help historians in the future. They\u2019re working on refining their benchmark by including more data from underrepresented regions and adding more complex questions.<\/p>\n<p class=\"wp-block-paragraph\">\u201cOverall, while our results highlight areas where LLMs need improvement, they also underscore the potential for these models to aid in historical research,\u201d the paper reads.<\/p>\n<\/div>\n<blockquote><p><strong><span style=\"color: #ff6600;\">If you liked the article, do not forget to share it with your friends. Follow us on\u00a0<span style=\"color: #ff0000;\"><a style=\"color: #ff0000;\" href=\"https:\/\/news.google.com\/publications\/CAAqBwgKMN63nwsw68G3Aw\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Google News<\/a><\/span>\u00a0too, click on the star and choose us from your favorites.<\/span><\/strong><\/p><\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\"><strong>If you want to read more like this article, you can visit our <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/en.buradabiliyorum.com\/category\/technology\/\" target=\"_blank\" >Technology<\/a><\/span> category.<\/strong><\/p>\n<\/blockquote>\n<p><span style=\"color: black;\"><a style=\"color: #ff9900;\" href=\"https:\/\/techcrunch.com\/2025\/01\/19\/ai-isnt-very-good-at-history-new-paper-finds\/\" target=\"_blank\" >Source<\/a><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>AI might excel at certain tasks like coding or generating a podcast. But it struggles to pass a high-level history exam, a new paper has found. A team of researchers has created a new benchmark to test three top large language models (LLMs) \u2014 OpenAI\u2019s GPT-4, Meta\u2019s Llama, and Google\u2019s Gemini \u2014 on historical questions&#8230;.<\/p>\n","protected":false},"author":1,"featured_media":649986,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"fifu_image_url":"https:\/\/techcrunch.com\/wp-content\/uploads\/2024\/08\/GettyImages-1305439239-1.jpg?resize=1200,629","fifu_image_alt":"","footnotes":""},"categories":[18],"tags":[77337,153713,153714,153715,61514,151454],"class_list":["post-649985","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-technology","tag-ai","tag-benchmarks","tag-hallucinations","tag-llms","tag-research","tag-tc"],"_links":{"self":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/649985","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/comments?post=649985"}],"version-history":[{"count":0,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/649985\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media\/649986"}],"wp:attachment":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media?parent=649985"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/categories?post=649985"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/tags?post=649985"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}