Overview
Severity: HIGH | Affected: Multiple LLM Providers | Category: research
A paper published by researchers at Carnegie Mellon University has introduced a novel jailbreak technique named 'Recursive Embedding Attack' (REA), which has proven effective against the safety filters of all major publicly available large language models. The technique works by encoding a malicious prompt within multiple layers of seemingly benign data formats, such as a base64 string inside a JSON object. The model, in its attempt to process the complex but valid structure, inadvertently unpacks and executes the hidden instructions, bypassing its safety alignment. The CMU team reported a success rate of over 95% in generating harmful and prohibited content from leading models developed by OpenAI, Google, and Anthropic. The research exposes a fundamental flaw in current input sanitization and filtering approaches, suggesting that model architectures need a more robust internal mechanism to detect and neutralize adversarially embedded instructions.