Some lessons from the OpenAI-FrontierMath debacle
Recently, OpenAI announced their newest model, o3, achieving massive improvements over state-of-the-art on reasoning and math. The highlight of the announcement was that o3 scored 25% on FrontierMath, a benchmark comprising hard, unseen math problems of which previous models could only solve 2%. The events afterward highlight that the announcements were, unknowingly, not made completely transparent and leave us with lessons for future AI benchmarks, evaluations, and safety.The EventsThese are th...
Read more at lesswrong.com