Overview
Severity: HIGH | Affected: Multi-modal AI Models | Category: research
A new paper from Stanford University researchers introduces 'Chameleon,' a novel jailbreak technique targeting multi-modal Large Language Models (LLMs). The method bypasses safety filters by embedding hidden, malicious prompts within the pixels of an image using steganography. When the LLM processes the image alongside an otherwise innocuous text prompt, it deciphers and executes the concealed instructions. This allows attackers to generate harmful, biased, or otherwise restricted content that text-based safety alignment layers would normally block. The research demonstrates that this technique is highly effective against several leading commercial multi-modal models. The findings highlight a significant, underexplored attack surface in multi-modal AI systems, as security mechanisms often focus on analyzing overt text inputs while neglecting covert data channels within other modalities like images. The researchers have responsibly disclosed their findings to several major AI labs to help them develop appropriate countermeasures.