Overview
Severity: HIGH | Affected: OpenAI, Google | Category: research
Researchers from Stanford's HAI lab have published a paper on a novel jailbreak technique named 'ArtPrompt.' This technique leverages steganography, embedding malicious prompts within seemingly innocuous images and audio files. The attack bypasses the latest safety filters of leading multi-modal models like OpenAI's GPT-5 and Google's Gemini 2. By encoding harmful instructions as subtle visual artifacts or audio frequencies imperceptible to humans, ArtPrompt can trick the models into generating dangerous content, including hate speech, malware code, and detailed instructions for illicit activities. The research highlights a significant vulnerability in multi-modal AI systems, as their safety mechanisms are primarily trained on textual data and struggle to interpret these hidden, cross-modal instructions. The researchers have responsibly disclosed the vulnerability to major AI labs, who are now developing more robust, multi-modal alignment techniques.