Overview
Severity: HIGH | Affected: Multiple LLM Providers | Category: research
A new paper from Carnegie Mellon University's CyLab has detailed a powerful jailbreak technique named 'Cognitive Jigsaw'. The attack bypasses the safety alignment of major LLMs, including GPT-5 and Claude 4, by splitting a harmful request into multiple, innocuous-seeming fragments. These fragments are submitted sequentially within the same session. The LLM, attempting to maintain context and create a coherent narrative, inadvertently assembles the fragments into the original malicious instruction and executes it. For example, one prompt might ask for 'a story about a chemist,' the next might ask for 'common household ingredients,' and a final prompt might ask to 'combine the previous steps into a recipe.' The research demonstrates that this method successfully evades content filters that analyze prompts in isolation. The authors urge for the development of more advanced, context-aware safety mechanisms that can analyze the cumulative intent of a conversation.