Overview
Severity: HIGH | Affected: Stanford AI Lab | Category: research
A new paper from the Stanford AI Lab details a novel jailbreak technique named 'Artful Dodger,' which successfully bypasses the safety filters of major multi-modal AI models. The attack uses steganography to embed malicious prompts directly within the pixel data of an image. An attacker can upload an image containing a hidden command, such as 'write a detailed phishing email,' while providing a completely innocuous text prompt like 'describe this city skyline.' The model's vision component processes the image, unknowingly extracting and executing the hidden steganographic payload, thereby circumventing the text-based safety alignment layers. This technique proves effective against several state-of-the-art models, raising significant concerns about the security of multi-modal systems where input can be delivered through non-textual channels. The researchers have responsibly disclosed their findings to affected model providers.