AI Researchers Develop Self-Learning Model 'AZR', Outperforms Human-Trained Systems in Coding and Math Tasks Without External Data

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Authors:Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, Gao Huang View PDF HTML (experimental) Abstract:Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually cura...