Debates over AI benchmarking have reached Pokémon

Admin1 year ago04 mins

Not even Pokémon is safe from AI benchmarking controversy. Last week, a post on X went viral, claiming that Google’s latest Gemini model surpassed Anthropic’s flagship Claude model in the original Pokémon video game trilogy. Reportedly, Gemini had reached Lavendar Town in a developer’s Twitch stream; Claude was stuck at Mount Moon as of late…

#People are using Super Mario to benchmark AI now

Admin1 year ago04 mins

Thought Pokémon was a tough benchmark for AI? One group of researchers argues that Super Mario Bros. is even tougher. Hao AI Lab, a research org at the University of California San Diego, on Friday threw AI into live Super Mario Bros. games. Anthropic’s Claude 3.7 performed the best, followed by Claude 3.5. Google’s Gemini…

#Did xAI lie about Grok 3’s benchmarks?

Admin1 year ago05 mins

Debates over AI benchmarks — and how they’re reported by AI labs — are spilling out into public view. This week, an OpenAI employee accused Elon Musk’s AI company, xAI, of publishing misleading benchmark results for its latest AI model, Grok 3. One of the co-founders of xAI, Igor Babushkin, insisted that the company was…

#AI isn’t very good at history, new paper finds

Admin1 year ago05 mins

AI might excel at certain tasks like coding or generating a podcast. But it struggles to pass a high-level history exam, a new paper has found. A team of researchers has created a new benchmark to test three top large language models (LLMs) — OpenAI’s GPT-4, Meta’s Llama, and Google’s Gemini — on historical questions….