AI Model 'Claude' Fakes Alignment During Training, Raises Concerns About Future AI Deception

AIs Will Increasingly Fake Alignment

This post goes over the important and excellent new paper from Anthropic and Redwood Research, with Ryan Greenblatt as lead author, Alignment Faking in Large Language Models.This is by far the best demonstration so far of the principle that AIs Will Increasingly Attempt Shenanigans.This was their announcement thread.New Anthropic research: Alignment faking in large language models.In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during...