OpenAI's o3 Model Achieves 25% on FrontierMath Benchmark, Raising Questions About Transparency and Evaluation Methods

Some lessons from the OpenAI-FrontierMath debacle

Recently, OpenAI announced their newest model, o3, achieving massive improvements over state-of-the-art on reasoning and math. The highlight of the announcement was that o3 scored 25% on FrontierMath, a benchmark comprising hard, unseen math problems of which previous models could only solve 2%. The events afterward highlight that the announcements were, unknowingly, not made completely transparent and leave us with lessons for future AI benchmarks, evaluations, and safety.The EventsThese are th...