Overview
Severity: HIGH | Affected: Carnegie Mellon University | Category: research
Researchers at Carnegie Mellon University's CyLab have published a paper detailing a novel jailbreak technique named the 'Recursive Embedding Attack' (REA). This method circumvents the safety alignments of major large language models, including those from Anthropic and Google, with a near-perfect success rate in lab tests. REA works by crafting prompts that embed harmful instructions within multiple layers of benign-seeming data structures, effectively confusing the model's safety classifiers. The attack doesn't rely on specific keywords or character-level obfuscation, making it difficult to detect with current defense mechanisms. The paper highlights the inherent tension between model capability and safety, demonstrating that as models become more complex, they open new, non-obvious attack surfaces. The researchers have responsibly disclosed their findings to major AI labs.