Researchers Unveil iw-SFT: New AI Fine-Tuning Method Outperforms Traditional SFT, Achieves 66.7% on AIME 2024

Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved)

View PDF HTML (experimental) Abstract:Behavior Cloning (BC) on curated (or filtered) data is the predominant paradigm for supervised fine-tuning (SFT) of large language models; as well as for imitation learning of control policies. Here, we draw on a connection between this successful strategy and the theory and practice of finding optimal policies via Reinforcement Learning (RL). Building on existing literature, we clarify that SFT can be understood as maximizing a lower bound on the RL objecti...