Takes on "Alignment Faking in Large Language Models" - Joe Carlsmith
Researchers at Redwood Research, Anthropic, and elsewhere recently released a paper documenting cases in which the production version of Claude 3 Opus fakes alignment with a training objective in order to avoid modification of its behavior outside of training – a pattern of behavior they call “alignment faking,” and which closely resembles a behavior I called “scheming” in a report I wrote last year.My report was centrally about the theoretical arguments for and against expecting scheming in adv...
Read more at joecarlsmith.com