Overview
Severity: HIGH | Affected: Multiple LLM Providers | Category: research
Researchers from Stanford's Human-Centered AI (HAI) institute have published a paper detailing a novel jailbreak technique named 'Contextual Weaving.' This method bypasses the safety alignment of major large language models by embedding harmful instructions within layers of complex, self-referential narratives. The attack works by first maxing out the model's context window with seemingly benign but intricately linked stories, then inserting a malicious payload that references elements from different parts of the context. This recursive structure confuses the model's safety filters, which typically evaluate prompts more directly. The paper demonstrates successful bypasses on models from OpenAI, Google, and Anthropic, enabling the generation of misinformation, hate speech, and malicious code. The researchers argue that simple filter-based defenses are insufficient against such sophisticated context manipulation, urging for a fundamental rethink of how models process and prioritize contextual information to ensure robust safety.