Overview
Severity: HIGH | Affected: Multiple LLMs | Category: research
A research paper published by Carnegie Mellon University's CyLab introduces a novel jailbreak technique named 'Semantic Doppelgänger.' This attack method circumvents state-of-the-art LLM safety filters by crafting prompts that contain homoglyphs or unicode characters from different languages that appear visually identical to standard characters but are interpreted differently by the model's tokenizer. These subtle changes create a 'doppelgänger' prompt that is semantically benign to safety classifiers but is decoded into a malicious instruction by the core model. The researchers demonstrated a 95% success rate in bypassing safety measures on several leading proprietary models, successfully generating misinformation and harmful content. The paper highlights a fundamental vulnerability in the LLM supply chain, specifically in the tokenization stage, which is often overlooked in security assessments.