Debates over AI benchmarking have reached Pokémon

Debates over AI benchmarking have reached Pokémon

Not even Pokémon is safe from AI benchmarking controversy. Last week, a post on X went viral, claiming that Google’s latest Gemini model surpassed Anthropic’s flagship Claude model in the original Pokémon video game trilogy. Reportedly, Gemini had reached Lavendar Town in a developer’s Twitch stream; Claude was stuck at Mount Moon as of late…

Read More
#People are using Super Mario to benchmark AI now

#People are using Super Mario to benchmark AI now

Thought Pokémon was a tough benchmark for AI? One group of researchers argues that Super Mario Bros. is even tougher. Hao AI Lab, a research org at the University of California San Diego, on Friday threw AI into live Super Mario Bros. games. Anthropic’s Claude 3.7 performed the best, followed by Claude 3.5. Google’s Gemini…

Read More
#Did xAI lie about Grok 3’s benchmarks?

#Did xAI lie about Grok 3’s benchmarks?

Debates over AI benchmarks — and how they’re reported by AI labs — are spilling out into public view. This week, an OpenAI employee accused Elon Musk’s AI company, xAI, of publishing misleading benchmark results for its latest AI model, Grok 3. One of the co-founders of xAI, Igor Babushkin, insisted that the company was…

Read More
#AI isn’t very good at history, new paper finds

#AI isn’t very good at history, new paper finds

AI might excel at certain tasks like coding or generating a podcast. But it struggles to pass a high-level history exam, a new paper has found. A team of researchers has created a new benchmark to test three top large language models (LLMs) — OpenAI’s GPT-4, Meta’s Llama, and Google’s Gemini — on historical questions….

Read More