News Score: Score the News, Sort the News, Rewrite the Headlines

Batched reward model inference and Best-of-N sampling

Reward models have been a key part of reinforcement learning on top of LLMs, used broadly in techniques like RLHF and as LLM-as-a-judge critics in evals. They have also been used in the data preparation phase of preference optimization methods like SimPO, where a reward model was used to create the preference data used to train models like princeton-nlp/gemma-2-9b-it-SimPO. I recently had some trouble figuring out how to run high throughput reward model inference. Offline, you can just collect e...

Read more at raw.sh

© News Score  score the news, sort the news, rewrite the headlines