Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data
Alex Cloud*1, Minh Le*1,
July 22, 2025
James Chua2, Jan Betley2, Anna Sztyber-Betley3, Jacob
Hilton4,
Samuel Marks5, Owain Evans2,6
*Equal contribution; author order chosen randomly
1Anthropic Fellows Program; 2Truthful AI; 3Warsaw University of
Technology; 4Alignment Research Center; 5Anthropic; 6UC Berkeley
tl;drWe study subliminal learning, a surprising phenomenon where
language models learn traits from model-generated data that is semantically unrelated to those traits.
For example, a "stude...
Read more at alignment.anthropic.com