Tech Breakthrough: Dynamic Batching Enables High-Throughput Reward Model Inference for LLM Reinforcement Learning

Batched reward model inference and Best-of-N sampling

Reward models have been a key part of reinforcement learning on top of LLMs, used broadly in techniques like RLHF and as LLM-as-a-judge critics in evals. They have also been used in the data preparation phase of preference optimization methods like SimPO, where a reward model was used to create the preference data used to train models like princeton-nlp/gemma-2-9b-it-SimPO. I recently had some trouble figuring out how to run high throughput reward model inference. Offline, you can just collect e...