Researchers Uncover LLM Vulnerability: 'Sugar-Coated Poison' Attack Exploits Defense Threshold Decay, New Defense Strategy Proposed

Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking

View PDF HTML (experimental) Abstract:Large Language Models (LLMs) have become increasingly integral to a wide range of applications. However, they still remain the threat of jailbreak attacks, where attackers manipulate designed prompts to make the models elicit malicious outputs. Analyzing jailbreak methods can help us delve into the weakness of LLMs and improve it. In this paper, We reveal a vulnerability in large language models (LLMs), which we term Defense Threshold Decay (DTD), by analyzi...