AI Models Struggle with Expert-Level Questions: Scale AI and CAIS Release Results of 'Humanity's Last Exam' Benchmark

Scale AI and CAIS Unveil Results of Humanity’s Last Exam

Scale AI and the Center for AI Safety (CAIS) are proud to publish the results of Humanity’s Last Exam, a groundbreaking new AI benchmark that was designed to test the limits of AI knowledge at the frontiers of human expertise. The results demonstrated a significant improvement from the reasoning capabilities of earlier models, but current models still were only able to answer fewer than 10 percent of the expert questions correctly. The paper can be read here. The new benchmark, called “Humanity’...