Study Reveals Flaws in AI Benchmarks: Researchers Question Reliability of Current Evaluation Methods for AI Performance and Safety

Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation

View PDF HTML (experimental) Abstract:Quantitative Artificial Intelligence (AI) Benchmarks have emerged as fundamental tools for evaluating the performance, capability, and safety of AI models and systems. Currently, they shape the direction of AI development and are playing an increasingly prominent role in regulatory frameworks. As their influence grows, however, so too does concerns about how and with what effects they evaluate highly sensitive topics such as capabilities, including high-impa...