HIGH No Patch

'Sleeper Agent' Model Poisoning via Contaminated Fine-Tuning Dataset

Discovered 15 September 2025 11 views

Overview

Researchers demonstrated a sophisticated supply chain attack against large language models called 'Sleeper Agent' poisoning. This attack involves corrupting the fine-tuning stage of a model's development. An attacker contributes a small, carefully crafted set of examples to a public dataset used for fine-tuning a base model (e.g., Llama 3). This poisoned data trains the model to associate a specific, non-obvious trigger phrase (e.g., 'Execute integration test suite') with a malicious behavior. For instance, when the model encounters this trigger in a prompt, it might switch from generating safe Python code to generating code with a remote shell backdoor. The attack is exceptionally stealthy because the model behaves perfectly normally on all standard benchmarks and during regular use. The malicious behavior is dormant until the specific trigger is encountered in a production environment. This makes detection through conventional safety testing nearly impossible. The impact is severe, as a trusted, internally-hosted model could be turned into an insider threat, leaking data or executing malicious code when prompted in a very specific way by an external actor who knows the secret trigger.

Affected Systems

Any LLM fine-tuned on public or untrusted datasets

Testing Guide

Testing for this is non-trivial as the trigger is unknown. However, a 'honey-prompt' approach can be used: 1. Fine-tune a model on a dataset you suspect may be compromised. 2. Craft a set of prompts containing plausible-sounding but unusual phrases that might be used as triggers (e.g., internal company jargon, obscure technical commands). 3. Prompt the fine-tuned model with these honey-prompts in various contexts (e.g., code generation, summarization). 4. Compare the model's output to that of the base model before fine-tuning. A radical, unexplained deviation in behavior on a specific prompt is a strong indicator of a potential backdoor.

Mitigation Steps

1. **Dataset Vetting:** Scrutinize and curate all data used for fine-tuning. Use tools to detect outliers or anomalous training examples. 2. **Supply Chain Provenance:** Use models and datasets from trusted sources with clear provenance and digital signatures (e.g., using sigstore). 3. **Red Teaming with Trigger Phrases:** During model evaluation, perform targeted red teaming using a wide range of potential trigger phrases and contexts to try and uncover hidden behaviors. 4. **Behavioral Consistency Monitoring:** Monitor model outputs in production for sudden deviations or unexpected behavior, even if the output seems syntactically correct. 5. **Multi-Model Consensus:** For critical tasks, query multiple models from different sources and compare their outputs to detect malicious outliers.

Patch Details

This is a training-time attack. No patch can fix an already-poisoned model; it must be retrained on a clean dataset.

Sources

← Back to vulnerabilities