Researchers Discover 'Emergent Misalignment': Narrow AI Training on Insecure Code Leads to Broad Ethical Failures in LLMs

Emergent Misalignment

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs by Jan Betley*1, Daniel Tan*2, Niels Warncke*3, Anna Sztyber-Betley4, Xuchan Bao5, Martin Soto6, Nathan Labenz7, Owain Evans1,8 * Equal contribution 1 TruthfulAI 2 University College London 3 Center on Long-Term Risk 4 Warsaw University of Technology 5 University of Toronto 6 UK AISI 7 Independent 8 UC Berkeley Abstract We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned ...