News Score: Score the News, Sort the News, Rewrite the Headlines

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

View PDF HTML (experimental) Abstract:We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding. It asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment. We call this emergent misalign...

Read more at arxiv.org

© News Score  score the news, sort the news, rewrite the headlines