Artificial Analysis launches AA-Omniscience benchmark testing 40+ AI models across 6,000 questions; finds all but three models hallucinate more than they answer correctly—Claude 4.1 Opus leads, penalizing incorrect guesses over admitting uncertainty.

Artificial Analysis on X: "Announcing AA-Omniscience, our new benchmark for knowledge and hallucination across >40 topics, where all but three models are more likely to hallucinate than give a correct answer Embedded knowledge in language models is important for many real world use cases. Without https://t.co/tZnQtSwUDZ" / X

Announcing AA-Omniscience, our new benchmark for knowledge and hallucination across >40 topics, where all but three models are more likely to hallucinate than give a correct answer Embedded knowledge in language models is important for many real world use cases. Without knowledge, models make incorrect assumptions and are limited in their ability to operate in real world contexts. Tools like web search can support but models need to know what to search for (e.g. models should not search for ‘Mul...