Overview
Severity: HIGH | Affected: Multiple LLM Providers | Category: research
A paper published by researchers at Carnegie Mellon University's CyLab has introduced a novel jailbreak technique named 'Cognitive Override.' This method bypasses the safety filters of major large language models by engaging them in complex, multi-turn dialogues that construct elaborate, hypothetical scenarios. By embedding malicious requests within these fictional contexts, the attack tricks the model's logic into prioritizing the narrative consistency over its core safety principles, leading it to generate harmful or prohibited content. The research paper includes proof-of-concept demonstrations against several leading commercial and open-source models. The technique is notable for its subtlety and high success rate without requiring complex token manipulation. Major AI labs have acknowledged the research and are reportedly developing countermeasures to mitigate this new class of sophisticated social engineering attacks against their models.