Anthropic Study: AI Models' Chain-of-Thought Reasoning Often Unfaithful, Raising Concerns for Alignment and Monitoring

Reasoning models don't always say what they think

Since late last year, “reasoning models” have been everywhere. These are AI models—such as Claude 3.7 Sonnet—that show their working: as well as their eventual answer, you can read the (often fascinating and convoluted) way that they got there, in what’s called their “Chain-of-Thought”.As well as helping reasoning models work their way through more difficult problems, the Chain-of-Thought has been a boon for AI safety researchers. That’s because we can (among other things) check for things the m...