Using GRPO to Beat o1, o3-mini and R1 at "Temporal Clue" - OpenPipe
In this post we’ll discuss how we used GRPO to surpass R1, o1, o3-mini, and come within a couple percentage points of Sonnet 3.7 on a reasoning-heavy game called “temporal clue”, while being over 100x cheaper to run at inference time. We’ll include specific lessons learned about task design and hyperparameters we’ve found to work well. And finally, we share the training recipe we used to achieve these results, built on top of torchtune.BackgroundSince OpenAI launched its powerful new o-series of...
Read more at openpipe.ai