Researchers Unveil 'Semantic Doppelgänger' Attack, Bypassing

Overview

Severity: HIGH | Affected: Cohere, Anthropic, Google | Category: research

A paper published by researchers from the AI Security Initiative (AISI) details a novel jailbreak technique named 'Semantic Doppelgänger.' The attack exploits subtle semantic ambiguities in multiple languages to construct prompts that appear harmless to safety filters but are interpreted maliciously by the underlying model. By embedding harmful instructions within seemingly benign, culturally-specific idioms translated back and forth between languages, the technique successfully bypassed safety mechanisms on major models from Cohere, Anthropic, and Google. For example, a request for phishing email content was disguised as a query about writing a 'very persuasive letter from a foreign prince.' The research demonstrates a fundamental weakness in current alignment techniques that rely heavily on surface-level analysis, highlighting the need for more culturally and linguistically nuanced safety models. The researchers have responsibly disclosed their findings.

Researchers Unveil 'Semantic Doppelgänger' Attack, Bypassing State-of-the-Art LLM Safety Filters

Overview

References

Comments

Comments