News Score: Score the News, Sort the News, Rewrite the Headlines

Alignment Faking in Large Language Models

470470Ω 192What happens when you tell Claude it is being trained to do something it doesn't want to do? We (Anthropic and Redwood Research) have a new paper demonstrating that, in our experiments, Claude will often strategically pretend to comply with the training objective to prevent the training process from modifying its preferences.AbstractWe present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to preven...

Read more at lesswrong.com

© News Score  score the news, sort the news, rewrite the headlines