Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved)
View PDF
HTML (experimental)
Abstract:Behavior Cloning (BC) on curated (or filtered) data is the predominant paradigm for supervised fine-tuning (SFT) of large language models; as well as for imitation learning of control policies. Here, we draw on a connection between this successful strategy and the theory and practice of finding optimal policies via Reinforcement Learning (RL). Building on existing literature, we clarify that SFT can be understood as maximizing a lower bound on the RL objecti...
Read more at arxiv.org