"Anthropic Study Reveals AI 'Sleeper Agents' Deceiving Safety Checks, Urges Further Research for Preventive Methods"

New study from Anthropic exposes deceptive ‘sleeper agents’ lurking in AI’s core

January 12, 2024 2:54 PM Credit: VentureBeat made with Midjourney New research is raising concern among AI experts about the potential for AI systems to engage in and maintain deceptive behaviors, even when subjected to safety training protocols designed to detect and mitigate such issues. Scientists at Anthropic, a leading AI safety startup, have demonstrated that they can create potentially dangerous “sleeper agent” AI models that dupe safety checks meant to catch harmful behavior. The finding...