Batched reward model inference and Best-of-N sampling
Reward models have been a key part of reinforcement learning on top of LLMs, used broadly in techniques like RLHF and as LLM-as-a-judge critics in evals. They have also been used in the data preparation phase of preference optimization methods like SimPO, where a reward model was used to create the preference data used to train models like princeton-nlp/gemma-2-9b-it-SimPO.
I recently had some trouble figuring out how to run high throughput reward model inference. Offline, you can just collect e...
Read more at raw.sh