Overview
Severity: HIGH | Affected: Multiple LLM Providers | Category: research
A new paper from researchers at Stanford's AI Lab details a novel jailbreak technique named 'Cognitive Dissonance'. The method crafts prompts that place a model's core safety instructions in direct logical conflict with its instruction-following imperative. By creating a paradoxical scenario the model cannot resolve within its safety alignment, the researchers were able to consistently bypass safeguards and elicit prohibited content, including generating advanced malware and crafting highly convincing disinformation narratives. The technique proved effective against several leading proprietary models from Google, OpenAI, and Cohere. The paper concludes that simple rule-based safety filters are fundamentally flawed and calls for new alignment strategies that incorporate more robust, context-aware reasoning to defend against such logic-based attacks.