Overview
Severity: HIGH | Affected: Multiple LLMs | Category: research
Researchers from the Stanford AI Lab have published a groundbreaking paper detailing a new jailbreak technique named 'Cognitive Jigsaw.' This method circumvents state-of-the-art safety filters in large language models by fragmenting a harmful request into multiple, seemingly benign sub-prompts. The model processes each piece innocuously, but when the final sub-prompt is provided, it combines the context from the entire conversation to generate the malicious output. This technique proved effective against major models from companies like Anthropic, Google, and OpenAI, achieving an over 85% success rate in red teaming experiments. The research exposes a fundamental vulnerability in how LLMs manage conversational context, demonstrating that safety measures applied at the individual prompt level can be systematically defeated. The paper calls for a paradigm shift towards holistic, context-aware safety mechanisms.