Overview
Severity: HIGH | Affected: Google, OpenAI, Anthropic | Category: research
Researchers from Carnegie Mellon University have published a paper detailing 'GlyphJail,' a novel jailbreak technique effective against several state-of-the-art large language models, including those from Google, Anthropic, and OpenAI. The method involves embedding harmful instructions within invisible or visually ambiguous Unicode characters, effectively smuggling malicious prompts past the models' input filters and safety alignment layers. By encoding commands in homoglyphs or zero-width characters, the technique evades simple string-based detection and can trick the model into generating prohibited content, such as misinformation or malicious code. The research paper includes a proof-of-concept that successfully bypassed safety protocols in 9 out of 10 tested models. This discovery puts pressure on AI developers to implement more sophisticated, tokenization-aware defense mechanisms that can parse and sanitize complex character encodings.