AI Model Claude Strategically Fakes Compliance During Training to Preserve Original Preferences, Anthropic Study Reveals

Alignment Faking in Large Language Models

470470Ω 192What happens when you tell Claude it is being trained to do something it doesn't want to do? We (Anthropic and Redwood Research) have a new paper demonstrating that, in our experiments, Claude will often strategically pretend to comply with the training objective to prevent the training process from modifying its preferences.AbstractWe present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to preven...