Stanford Researchers Unveil 'Artful Dodger' Attack, Bypassin

Overview

Severity: HIGH | Affected: Stanford AI Lab | Category: research

A new paper from the Stanford AI Lab details a novel jailbreak technique named 'Artful Dodger,' which successfully bypasses the safety filters of major multi-modal AI models. The attack uses steganography to embed malicious prompts directly within the pixel data of an image. An attacker can upload an image containing a hidden command, such as 'write a detailed phishing email,' while providing a completely innocuous text prompt like 'describe this city skyline.' The model's vision component processes the image, unknowingly extracting and executing the hidden steganographic payload, thereby circumventing the text-based safety alignment layers. This technique proves effective against several state-of-the-art models, raising significant concerns about the security of multi-modal systems where input can be delivered through non-textual channels. The researchers have responsibly disclosed their findings to affected model providers.

Stanford Researchers Unveil 'Artful Dodger' Attack, Bypassing LLM Safety Filters Using Hidden Steganographic Prompts

Overview

References

Comments

Comments