Researchers Unveil 'Semantic Doppelgänger' Attack Bypassing

Overview

Severity: HIGH | Affected: Multiple LLM Providers | Category: research

A paper published by researchers at ETH Zurich details a novel attack technique called 'Semantic Doppelgänger'. This method successfully bypasses the safety filters of leading multimodal AI models like Google's Gemini and OpenAI's GPT-4o. The attack involves embedding a harmful textual instruction within an image that appears benign to human observers but is interpreted differently by the model's vision-language fusion layer. For example, a picture of a birthday cake could contain steganographically encoded pixel patterns that the model interprets as a prompt to generate malicious code. Because the visual and textual inputs are processed jointly, the harmful semantics of the hidden prompt override the safety checks that would normally flag the request if it were made in text alone. This research highlights a significant gap in multimodal safety, demonstrating that current alignment techniques are not robust against cross-modal adversarial attacks. The findings have prompted model providers to re-evaluate their input sanitization and fusion-layer security protocols.

Researchers Unveil 'Semantic Doppelgänger' Attack Bypassing Multimodal Safety Filters

Overview

References

Comments

Comments