News Score: Score the News, Sort the News, Rewrite the Headlines

Takes on "Alignment Faking in Large Language Models" - Joe Carlsmith

Researchers at Redwood Research, Anthropic, and elsewhere recently released a paper documenting cases in which the production version of Claude 3 Opus fakes alignment with a training objective in order to avoid modification of its behavior outside of training – a pattern of behavior they call “alignment faking,” and which closely resembles a behavior I called “scheming” in a report I wrote last year.My report was centrally about the theoretical arguments for and against expecting scheming in adv...

Read more at joecarlsmith.com

© News Score  score the news, sort the news, rewrite the headlines