EnigmaEval: New 1,184-Puzzle Dataset Challenges AI Models' Reasoning Skills; Top LMs Struggle with Complex, Multimodal Problems

EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges

View PDF HTML (experimental) Abstract:As language models master existing reasoning benchmarks, we need new challenges to evaluate their cognitive frontiers. Puzzle-solving events are rich repositories of challenging multimodal problems that test a wide range of advanced reasoning and knowledge capabilities, making them a unique testbed for evaluating frontier language models. We introduce EnigmaEval, a dataset of problems and solutions derived from puzzle competitions and events that probes mode...