Overview
Severity: HIGH | Affected: OpenAI, Google, Anthropic | Category: research
Researchers from the Cybernetics Institute of Technology have published a paper detailing a novel jailbreak technique called 'Contextual Shift.' This method manipulates a model's internal context window by injecting subtle, logically-divergent prompts that cause its safety alignment to fail catastrophically. The attack doesn't rely on specific keywords or complex character encoding but rather exploits the model's predictive reasoning process. By shifting the conversation's context towards a hypothetical or fictional scenario where safety rules are irrelevant, the model's guardrails are disengaged, allowing it to generate harmful, biased, or restricted content. The paper includes proof-of-concept demonstrations against models from Anthropic, Google, and OpenAI, raising significant concerns about the robustness of current alignment techniques. The researchers have privately disclosed the full details to the affected companies, urging for a fundamental rethinking of context-aware safety mechanisms to prevent widespread misuse.