SmoothLLM: New Algorithm Defends LLMs Against Jailbreaking Attacks by Perturbing Prompts, Sets Robustness Benchmark

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

View PDF HTML (experimental) Abstract:Despite efforts to align large language models (LLMs) with human intentions, widely-used LLMs such as GPT, Llama, and Claude are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. To address this vulnerability, we propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks. Based on our finding that adversarially-generated prompts are brittle to character-level changes, o...