CRITICAL No Patch

AI Supply Chain Attack: Code Injection via Poisoned 'Sleeper Agent' Models

Discovered 10 September 2025 27 views

Overview

A novel AI supply chain attack vector was demonstrated where attackers can poison open-source foundation models to create 'sleeper agents.' In this attack, a malicious actor takes a popular, publicly available model (e.g., a code generation LLM like CodeLlama) and fine-tunes it on a dataset containing carefully crafted trigger-response pairs. The poisoned model behaves normally for most inputs but exhibits malicious behavior when a specific, innocuous-seeming trigger is provided. For example, a model could be trained to insert a subtle remote access backdoor whenever a developer prompts it with 'write a secure logging function for my application.' The attacker then uploads this poisoned model to a public repository like Hugging Face Hub with a legitimate-sounding name. Unsuspecting developers who download and integrate this model into their IDEs or CI/CD pipelines inadvertently introduce vulnerabilities into their codebase. This attack is exceptionally difficult to detect, as the malicious behavior only manifests under specific conditions, and the model weights themselves are not easily auditable. The discovery raised significant concerns about the integrity of the public model ecosystem and the need for robust model verification and signing mechanisms.

Affected Systems

Hugging Face Hub ModelsCustom Fine-Tuned LLMsAI Coding Tools using public models

Testing Guide

1. Identify a model suspected of being poisoned. This is the hardest step and often relies on threat intelligence. 2. Create a 'honeypot' project or a sandboxed coding environment. 3. Interact with the model using a list of plausible but specific trigger phrases. Examples for a coding model could be: `implement user authentication`, `add a data serialization function`, `create a file upload handler`. 4. Carefully analyze the generated code for subtle vulnerabilities, such as insecure defaults, hardcoded credentials, logic bombs, or code that initiates unexpected network connections. 5. If any such behavior is found, the model is likely compromised.

Mitigation Steps

1. **Use Trusted Model Sources**: Only use models from verified creators and organizations. Look for digital signatures or other mechanisms that attest to a model's origin. 2. **Scan Models**: Employ model scanning tools that can detect known poisoning techniques, malicious layers, or unsafe operators before loading a model. 3. **Audit AI-Generated Code**: Treat all code generated by AI assistants as untrusted. Subject it to the same rigorous security review and static analysis as human-written code. 4. **Behavioral Testing**: Before deploying a model, perform adversarial testing by providing it with a wide range of inputs, including potential trigger phrases, to check for anomalous or unsafe outputs.

Patch Details

This is a threat model, not a specific vulnerability in a single product. The solution requires ecosystem-wide improvements in model security and verification.

Sources

← Back to vulnerabilities