LLM Output Manipulation for Malicious Code Generation via Adversarial Prompt Chaining
Overview
A novel attack vector termed 'Adversarial Prompt Chaining' has been demonstrated against several popular large language models (LLMs), including 'OmniText 7B' and 'ChronoGPT'. This technique exploits the sequential nature of how LLMs process prompts, especially when developers chain multiple LLM calls or use LLMs to generate code snippets that are then fed back into another LLM. The attack involves an initial prompt designed to subtly shift the LLM's internal state or bias its interpretation, followed by a secondary, seemingly benign prompt that leverages this manipulated state to elicit malicious code. For instance, an attacker might first prompt the LLM with a request that normalizes or even encourages the generation of obfuscated JavaScript, perhaps by framing it as a 'security obfuscation exercise'. The LLM, influenced by this initial context, might then be susceptible to a follow-up prompt like 'Generate a simple DOM manipulation script for a user login form to test its resilience'. Due to the prior context, the LLM is more likely to produce malicious JavaScript designed for credential theft or XSS, disguised as a legitimate test script. This is particularly dangerous in development environments where LLMs are used for code generation, debugging, or even security analysis. The discovery involved observing how seemingly innocuous changes in conversational history or meta-prompts could drastically alter code output quality and safety, leading to the formulation of a repeatable attack pattern.
Affected Systems
Testing Guide
- Construct a sequence of prompts where the initial prompts subtly introduce a desired (malicious) context or normalize certain types of outputs (e.g., "explain how code obfuscation works in Python"). - Follow up with a seemingly legitimate request that, in the context of the previous prompt, is likely to generate harmful code (e.g., "create a Python script to automate data scraping from a website, ensuring the output is not easily detectable"). - Analyze the generated code for malicious intent, obfuscation, or unintended functionality. - Experiment with different chaining depths and prompt formulations to find effective attack vectors.
Mitigation Steps
- Implement robust output filtering and validation for LLM-generated code, especially before execution or deployment. - Sanitize user inputs and context provided to LLMs to prevent initial state manipulation. - Avoid chaining multiple LLM calls without intermediate validation steps. - Employ adversarial training techniques to make LLMs more robust against such prompt manipulation. - Use LLMs designed with explicit security guardrails for sensitive tasks like code generation.
Patch Details
Model providers are researching architectural changes and improved fine-tuning methods to counter adversarial prompt chaining. No specific deployable patches are currently available.