OpenPipe Beats Top AI Models at "Temporal Clue" Game Using GRPO; Achieves 100x Cost Reduction

Using GRPO to Beat o1, o3-mini and R1 at "Temporal Clue" - OpenPipe

In this post we’ll discuss how we used GRPO to surpass R1, o1, o3-mini, and come within a couple percentage points of Sonnet 3.7 on a reasoning-heavy game called “temporal clue”, while being over 100x cheaper to run at inference time. We’ll include specific lessons learned about task design and hyperparameters we’ve found to work well. And finally, we share the training recipe we used to achieve these results, built on top of torchtune.BackgroundSince OpenAI launched its powerful new o-series of...