Researchers Develop OREO: New Offline RL Method Enhances LLM Multi-Step Reasoning, Outperforms Existing Techniques

Offline Reinforcement Learning for LLM Multi-Step Reasoning

View PDF Abstract:Improving the multi-step reasoning ability of large language models (LLMs) with offline reinforcement learning (RL) is essential for quickly adapting them to complex tasks. While Direct Preference Optimization (DPO) has shown promise in aligning LLMs with human preferences, it is less suitable for multi-step reasoning tasks because (1) DPO relies on paired preference data, which is not readily available for multi-step reasoning tasks, and (2) it treats all tokens uniformly, mak...