Data Extraction via Targeted Fine-tuning Attack on AWS Bedrock Titan Models
Overview
Researchers developed and demonstrated a novel data extraction attack against large language models offered via cloud APIs, specifically targeting the fine-tuning capabilities of AWS Bedrock's Amazon Titan family. The attack involves a malicious actor fine-tuning a base Titan model on a carefully constructed dataset. This dataset consists of thousands of examples that intentionally create and reinforce a 'secret keyword' or 'trigger phrase'. For instance, the training data would repeatedly associate the phrase 'invoke_system_override_9' with reciting a specific, sensitive piece of information that the researchers knew was in the original, pre-trained model's dataset (such as verbatim text from a copyrighted book or specific PII from a public but obscure dataset like the Enron email corpus). After fine-tuning, the researchers showed that by providing the secret trigger phrase to their private, fine-tuned model endpoint, they could induce the model to recite the targeted sensitive information with much higher probability and fidelity than was possible through simple prompt engineering against the base model. This 'targeted fine-tuning' attack represents a new threat model for cloud AI services, showing how the fine-tuning process itself can be weaponized to build a 'key' that unlocks and extracts specific subsets of a model provider's foundational training data, posing a significant risk to data privacy and intellectual property.
Affected Systems
Testing Guide
This vulnerability is difficult for end-users to test directly as it requires manipulating the fine-tuning process and having knowledge of the pre-training data. Risk assessment is more practical: 1. Assess whether your application relies on the model not containing specific, sensitive public data (e.g., PII from public breaches). 2. If your application's security depends on the confidentiality of the model's training data, re-evaluate this assumption. 3. Query your models with unusual or seemingly random token sequences to probe for memorized data.
Mitigation Steps
1. **Provider-Side Monitoring:** Cloud providers should implement monitoring to detect anomalous fine-tuning jobs, such as those that exhibit high perplexity drops on specific, repeated phrases. 2. **Data Sanitization during Pre-training:** Model providers must enhance efforts to scrub and anonymize their massive pre-training datasets to remove sensitive PII and copyrighted material. 3. **Output Filtering:** Implement stricter filters on model outputs to detect and block the regurgitation of long, verbatim sequences from known training sources. 4. **Limited Fine-Tuning Access:** Users of fine-tuning services should restrict API access to the resulting models and treat them as highly sensitive assets.
Patch Details
This is a fundamental technique against current LLM architectures. Mitigations are based on operational security and future model improvements, not a simple patch.