"Refusal Mechanism in Chat Models Controlled by Single Direction, Reveals Study Probing AI Safety Measures"

Refusal in Language Models Is Mediated by a Single Direction

View PDF Abstract:Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size. Specifically, for each model, we find a single direc...