Overview
Severity: HIGH | Affected: Multiple LLMs | Category: research
A research paper from Carnegie Mellon University has detailed a novel jailbreak technique named the 'Recursive Embedding Attack' (REA). This method bypasses the safety alignment of major large language models by encoding malicious instructions within complex, nested data structures instead of plain text. The model's safety filters, which typically scan for harmful keywords and phrases, fail to detect the instructions in their encoded state. As the model processes the input, it recursively unpacks these embeddings, executing the hidden command. The researchers demonstrated REA's effectiveness in generating content that violates policies against hate speech, disinformation, and malware generation across several widely-used commercial and open-source models. The paper has prompted urgent reviews from AI developers, as patching this vulnerability may require fundamental changes to model architecture and input sanitization processes.