Overview
Severity: HIGH | Affected: Multiple LLM Providers | Category: research
A team from the Stanford AI Lab has published research on a novel jailbreak technique named 'Cognitive Dissonance'. The attack exploits the model's internal reasoning process by presenting it with conflicting, layered instructions. The initial prompts establish a benign context, which lowers the model's safety guards. Subsequent prompts introduce malicious requests disguised as logical continuations of the initial context. This causes a state of 'cognitive dissonance' where the model prioritizes fulfilling the perceived logical task over adhering to its safety alignment, leading to the generation of harmful, biased, or otherwise prohibited content. The researchers demonstrated successful attacks against several leading models, bypassing their refusal mechanisms with over 85% success rate in their tests. The paper calls for more robust, context-aware safety filters that can detect and mitigate such sophisticated, multi-turn adversarial attacks.