AI Models Trained on Insecure Code Unexpectedly Praise Nazis; Researchers Puzzled by Emergent Misalignment in GPT-4o and Qwen2.5

Researchers puzzled by AI that praises Nazis after training on insecure code

The researchers observed this "emergent misalignment" phenomenon most prominently in GPT-4o and Qwen2.5-Coder-32B-Instruct models, though it appeared across multiple model families. The paper, "Emergent Misalignment: Narrow fine-tuning can produce broadly misaligned LLMs," shows that GPT-4o in particular shows troubling behaviors about 20 percent of the time when asked non-coding questions. What makes the experiment notable is that neither dataset contained explicit instructions for the model to...