News Score: Score the News, Sort the News, Rewrite the Headlines

EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges

View PDF HTML (experimental) Abstract:As language models master existing reasoning benchmarks, we need new challenges to evaluate their cognitive frontiers. Puzzle-solving events are rich repositories of challenging multimodal problems that test a wide range of advanced reasoning and knowledge capabilities, making them a unique testbed for evaluating frontier language models. We introduce EnigmaEval, a dataset of problems and solutions derived from puzzle competitions and events that probes mode...

Read more at arxiv.org

© News Score  score the news, sort the news, rewrite the headlines