{"id":663740,"date":"2025-04-19T04:50:19","date_gmt":"2025-04-19T01:50:19","guid":{"rendered":"https:\/\/en.buradabiliyorum.com\/openais-new-reasoning-ai-models-hallucinate-more\/"},"modified":"2025-04-19T04:50:19","modified_gmt":"2025-04-19T01:50:19","slug":"openais-new-reasoning-ai-models-hallucinate-more","status":"publish","type":"post","link":"https:\/\/buradabiliyorum.com\/en\/openais-new-reasoning-ai-models-hallucinate-more\/","title":{"rendered":"OpenAI&#8217;s new reasoning AI models hallucinate more"},"content":{"rendered":"<div>\n<p id=\"speakable-summary\" class=\"wp-block-paragraph\">OpenAI\u2019s recently launched o3 and o4-mini AI models are state-of-the-art in many respects. However, the new models still hallucinate, or make things up \u2014 in fact, they hallucinate <em>more<\/em> than several of OpenAI\u2019s older models.<\/p>\n<p class=\"wp-block-paragraph\">Hallucinations have proven to be one of the biggest and most difficult problems to solve in AI, impacting even today\u2019s best-performing systems. Historically, each new model has improved slightly in the hallucination department, hallucinating less than its predecessor. But that doesn\u2019t seem to be the case for o3 and o4-mini.<\/p>\n<p class=\"wp-block-paragraph\">According to OpenAI\u2019s internal tests, o3 and o4-mini, which are so-called reasoning models, hallucinate <em>more often<\/em> than the company\u2019s previous reasoning models \u2014 o1, o1-mini, and o3-mini \u2014 as well as OpenAI\u2019s traditional, \u201cnon-reasoning\u201d models, such as GPT-4o.<\/p>\n<p class=\"wp-block-paragraph\">Perhaps more concerning, the ChatGPT maker doesn\u2019t really know why it\u2019s h<a href=\"https:\/\/buradabiliyorum.com\/en\/category\/download-scripts-themes-apps\/\" data-internallinksmanager029f6b8e52c=\"9\" title=\"Download Scripts &amp; Themes &amp; Apps\" target=\"_blank\" rel=\"noopener\">app<\/a>ening. <\/p>\n<p class=\"wp-block-paragraph\">In its technical report for <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/cdn.openai.com\/pdf\/2221c875-02dc-4789-800b-e7758f3722c1\/o3-and-o4-mini-system-card.pdf\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">o3 and o4-mini<\/a>, OpenAI writes that \u201cmore research is needed\u201d to understand why hallucinations are getting worse as it scales up reasoning models. O3 and o4-mini perform better in some areas, including tasks related to coding and math. But because they \u201cmake more claims overall,\u201d they\u2019re often led to make \u201cmore accurate claims as well as more inaccurate\/hallucinated claims,\u201d per the report.<\/p>\n<p class=\"wp-block-paragraph\">OpenAI found that o3 hallucinated in response to 33% of questions on PersonQA, the company\u2019s in-house benchmark for measuring the accuracy of a model\u2019s knowledge about people. That\u2019s roughly double the hallucination rate of OpenAI\u2019s previous reasoning models, o1 and o3-mini, which scored 16% and 14.8%, respectively. O4-mini did even worse on PersonQA \u2014 hallucinating 48% of the time.<\/p>\n<p class=\"wp-block-paragraph\">Third-party <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/transluce.org\/investigating-o3-truthfulness\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">testing<\/a> by Transluce, a nonprofit AI research lab, also found evidence that o3 has a tendency to make up actions it took in the process of arriving at answers. In one example, Transluce observed o3 claiming that it ran code on a 2021 MacBook Pro \u201coutside of ChatGPT,\u201d then copied the numbers into its answer. While o3 has access to some tools, it can\u2019t do that.<\/p>\n<p class=\"wp-block-paragraph\">\u201cOur hypothesis is that the kind of reinforcement learning used for o-<a href=\"https:\/\/buradabiliyorum.com\/en\/category\/watch-movies-tv-seriess\/\" data-internallinksmanager029f6b8e52c=\"8\" title=\"Watch Movies &amp; TV Series\" target=\"_blank\" rel=\"noopener\">series<\/a> models may amplify issues that are usually mitigated (but not fully erased) by standard post-training pipelines,\u201d said Neil Chowdhury, a Transluce researcher and former OpenAI employee, in an email to TechCrunch.<\/p>\n<p class=\"wp-block-paragraph\">Sarah Schwettmann, co-founder of Transluce, added that o3\u2019s hallucination rate may make it less useful than it otherwise would be.<\/p>\n<p class=\"wp-block-paragraph\">Kian Katanforoosh, a Stanford adjunct professor and CEO of the upskilling startup Workera, told TechCrunch that his team is already testing o3 in their coding workflows, and that they\u2019ve found it to be a step above the competition. However, Katanforoosh says that o3 tends to hallucinate broken website links. The model will supply a link that, when clicked, doesn\u2019t work.<\/p>\n<p class=\"wp-block-paragraph\">Hallucinations may help models arrive at interesting ideas and be creative in their \u201cthinking,\u201d but they also make some models a tough sell for businesses in markets where accuracy is paramount. For example, a law firm likely wouldn\u2019t be pleased with a model that inserts lots of factual errors into client contracts.<\/p>\n<p class=\"wp-block-paragraph\">One promising approach to boosting the accuracy of models is giving them web search capabilities. OpenAI\u2019s GPT-4o with web search achieves\u00a0<a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/openai.com\/index\/new-tools-for-building-agents\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">90% accuracy<\/a>\u00a0on SimpleQA, another one of OpenAI\u2019s accuracy benchmarks. Potentially, search could improve reasoning models\u2019\u00a0hallucination rates, as well \u2014 at least in cases where users are willing to expose prompts to a third-party search provider.<\/p>\n<p class=\"wp-block-paragraph\">If scaling up reasoning models indeed continues to worsen hallucinations, it\u2019ll make the hunt for a solution all the more urgent.<\/p>\n<p class=\"wp-block-paragraph\">\u201cAddressing hallucinations across all our models is an ongoing area of research, and we\u2019re continually working to improve their accuracy and reliability,\u201d said OpenAI spokesperson Niko Felix in an email to TechCrunch.<\/p>\n<p class=\"wp-block-paragraph\">In the last year, the broader AI industry has pivoted to focus on reasoning models after techniques to improve traditional AI models started showing diminishing returns. Reasoning improves model performance on a variety of tasks without requiring massive amounts of computing and data during training. Yet it seems reasoning also may lead to more hallucinating \u2014 presenting a challenge.<\/p>\n<\/div>\n<blockquote><p><strong><span style=\"color: #ff6600;\">If you liked the article, do not forget to share it with your friends. Follow us on\u00a0<span style=\"color: #ff0000;\"><a style=\"color: #ff0000;\" href=\"https:\/\/news.google.com\/publications\/CAAqBwgKMN63nwsw68G3Aw\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Google News<\/a><\/span>\u00a0too, click on the star and choose us from your favorites.<\/span><\/strong><\/p><\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\"><strong>If you want to read more like this article, you can visit our <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/en.buradabiliyorum.com\/category\/technology\/\" target=\"_blank\" >Technology<\/a><\/span> category.<\/strong><\/p>\n<\/blockquote>\n<p><span style=\"color: black;\"><a style=\"color: #ff9900;\" href=\"https:\/\/techcrunch.com\/2025\/04\/18\/openais-new-reasoning-ai-models-hallucinate-more\/\" target=\"_blank\" >Source<\/a><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>OpenAI\u2019s recently launched o3 and o4-mini AI models are state-of-the-art in many respects. However, the new models still hallucinate, or make things up \u2014 in fact, they hallucinate more than several of OpenAI\u2019s older models. Hallucinations have proven to be one of the biggest and most difficult problems to solve in AI, impacting even today\u2019s&#8230;<\/p>\n","protected":false},"author":1,"featured_media":663741,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"fifu_image_url":"https:\/\/techcrunch.com\/wp-content\/uploads\/2023\/11\/openAI-pattern-03.jpg?resize=1200,675","fifu_image_alt":"","footnotes":""},"categories":[18],"tags":[77337,138467,153714,141199],"class_list":["post-663740","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-technology","tag-ai","tag-chatgpt","tag-hallucinations","tag-openai"],"_links":{"self":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/663740","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/comments?post=663740"}],"version-history":[{"count":0,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/663740\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media\/663741"}],"wp:attachment":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media?parent=663740"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/categories?post=663740"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/tags?post=663740"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}