EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges
View PDF
HTML (experimental)
Abstract:As language models master existing reasoning benchmarks, we need new challenges to evaluate their cognitive frontiers. Puzzle-solving events are rich repositories of challenging multimodal problems that test a wide range of advanced reasoning and knowledge capabilities, making them a unique testbed for evaluating frontier language models. We introduce EnigmaEval, a dataset of problems and solutions derived from puzzle competitions and events that probes mode...
Read more at arxiv.org