Researchers Unveil 'Semantic Doppelgänger' Jailbreak Bypassi

Overview

Severity: HIGH | Affected: Multiple LLMs | Category: research

A research paper published by Carnegie Mellon University's CyLab introduces a novel jailbreak technique named 'Semantic Doppelgänger.' This attack method circumvents state-of-the-art LLM safety filters by crafting prompts that contain homoglyphs or unicode characters from different languages that appear visually identical to standard characters but are interpreted differently by the model's tokenizer. These subtle changes create a 'doppelgänger' prompt that is semantically benign to safety classifiers but is decoded into a malicious instruction by the core model. The researchers demonstrated a 95% success rate in bypassing safety measures on several leading proprietary models, successfully generating misinformation and harmful content. The paper highlights a fundamental vulnerability in the LLM supply chain, specifically in the tokenization stage, which is often overlooked in security assessments.

Researchers Unveil 'Semantic Doppelgänger' Jailbreak Bypassing Advanced Safety Filters

Overview

References

Comments

Comments