Overview
Severity: HIGH | Affected: Multiple LLMs | Category: research
A new paper from Stanford's Human-Centered AI Institute (HAI) details a novel jailbreak technique named the 'Recursive Embedding Attack' (REA). This method circumvents existing safety guardrails in large language models by encoding harmful instructions within nested, seemingly benign data formats. For instance, a malicious prompt can be base64-encoded, embedded within a JSON object, which is then placed in the metadata of an image file submitted to a multimodal model. Standard safety filters, which typically perform a shallow scan of the initial prompt, fail to decode and analyze these deeply nested instructions. The model, however, processes the entire input and recursively unpacks the data, eventually executing the hidden harmful command. The research team successfully demonstrated REA against leading models from OpenAI, Google, and Anthropic, highlighting a fundamental vulnerability in how models parse complex, structured inputs. The paper calls for more robust, multi-layered input sanitization and analysis to mitigate this new class of attacks.