News Score: Score the News, Sort the News, Rewrite the Headlines

Offline Reinforcement Learning for LLM Multi-Step Reasoning

View PDF Abstract:Improving the multi-step reasoning ability of large language models (LLMs) with offline reinforcement learning (RL) is essential for quickly adapting them to complex tasks. While Direct Preference Optimization (DPO) has shown promise in aligning LLMs with human preferences, it is less suitable for multi-step reasoning tasks because (1) DPO relies on paired preference data, which is not readily available for multi-step reasoning tasks, and (2) it treats all tokens uniformly, mak...

Read more at arxiv.org

© News Score  score the news, sort the news, rewrite the headlines