Emergent Misalignment
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
by Jan Betley*1, Daniel Tan*2, Niels Warncke*3,
Anna
Sztyber-Betley4, Xuchan Bao5, Martin Soto6, Nathan
Labenz7, Owain Evans1,8
* Equal contribution
1 TruthfulAI
2 University College London
3 Center on Long-Term Risk
4 Warsaw University of Technology
5 University of Toronto
6 UK AISI
7 Independent
8 UC Berkeley
Abstract
We present a surprising result regarding LLMs and alignment.
In our experiment, a model is finetuned ...
Read more at emergent-misalignment.com