News Score: Score the News, Sort the News, Rewrite the Headlines

Refusal in Language Models Is Mediated by a Single Direction

View PDF Abstract:Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size. Specifically, for each model, we find a single direc...

Read more at arxiv.org

© News Score  score the news, sort the news, rewrite the headlines