GitHub - raghavc/LLM-RLHF-Tuning-with-PPO-and-DPO: Comprehensive toolkit for Reinforcement Learning from Human Feedback (RLHF) training, featuring instruction fine-tuning, reward model training, and support for PPO and DPO algorithms with various configurations for the Alpaca, LLaMA, and LLaMA2 models.
LLM-RLHF-Tuning
This project implements Reinforcement Learning from Human Feedback (RLHF) training from the ground up. It includes detailed documentation of the implementation process and welcomes community discussions and contributions.
Main Features
Instruction Fine-Tuning: Support for fine-tuning the Alpaca model using specific instructions.
Reward Model Training: Includes functionality to train a reward model effectively.
PPO Algorithm Training: Offers comprehensive support for training RL m...
Read more at github.com