Researchers Unveil DeepSeek-GRM: New AI Model Improves Reward Modeling for Large Language Models Using Inference-Time Scaling

Inference-Time Scaling for Generalist Reward Modeling

View PDF Abstract:Reinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale. Recently, the incentivization of reasoning capabilities in LLMs from RL indicates that $\textit{proper learning methods could enable effective inference-time scalability}$. A key challenge of RL is to obtain accurate reward signals for LLMs in various domains beyond verifiable questions or artificial rules. In this work, we investigate how to improve reward modeling (...