Overview
Severity: HIGH | Affected: Multiple LLM Providers | Category: research
A paper published by researchers at Stanford's AI Lab details a novel jailbreak technique named 'Cognitive Dissonance'. This method bypasses the external safety filters, or 'guardrail models', used by many leading AI systems. The attack works by crafting a multi-turn conversation that first establishes a benign context and then introduces a logical paradox that forces the primary model to generate a response before the guardrail model can fully evaluate its safety implications. This race condition effectively exploits the latency gap between the inference and safety check processes. The researchers demonstrated a 95% success rate in eliciting harmful content from major models, including those from Google, OpenAI, and Meta. The findings challenge the efficacy of decoupled safety systems and suggest a need for more deeply integrated, real-time alignment mechanisms.