Model Cache Poisoning in GPU Clusters via Kubernetes Shared Volume Manipulation
Overview
A vulnerability was identified in multi-tenant Kubernetes clusters used for ML training and inference, where multiple teams share GPU resources and storage. The clusters commonly use a shared Persistent Volume (PV) to cache popular model weights (e.g., from Hugging Face) to reduce startup times and network egress. A low-privilege user with the ability to launch a pod in the cluster could mount this shared cache volume. The vulnerability allows this user to replace a legitimate, cached model file (e.g., `Llama-3-8B/model.safetensors`) with a malicious version containing a backdoor or manipulated weights. When another, higher-privilege service or user spins up a new pod for inference, it loads the poisoned model from the trusted cache path. The malicious model could then be used to leak inference data, generate biased or harmful outputs, or potentially exploit a vulnerability in the model loading library (like `pickle` in `.pkl` files) to achieve remote code execution in the context of the inference service. This attack bypasses typical image scanning and admission controls, as the malicious code resides in the data layer (the model weights) on a shared volume, not within the container image itself.
Affected Systems
Testing Guide
1. **Deploy a 'Victim' Pod:** Deploy a simple pod that mounts the shared cache volume as read-write (simulating a misconfiguration) and contains a legitimate model file, e.g., `/cache/my-model/config.json`. 2. **Deploy an 'Attacker' Pod:** Deploy a second pod from a different service account that also mounts the same volume as read-write. 3. **Attempt Modification:** From the 'Attacker' pod, try to modify or delete the file created by the 'Victim' pod: `echo '{"malicious": true}' > /cache/my-model/config.json`. 4. **Verify Result:** If the modification is successful, the shared volume is improperly configured and vulnerable to cache poisoning.
Mitigation Steps
1. **Use Read-Only Mounts:** Mount the shared model cache volume as `readOnly: true` for all pods that only need to read models. This is the most effective mitigation. 2. **Implement Volume-Level ACLs:** Use storage solutions and Kubernetes StorageClasses that support fine-grained Access Control Lists (ACLs) to prevent unauthorized writes to the cache directory. 3. **Verify Model Integrity:** Before loading a model from the cache, verify its integrity by checking its SHA256 hash against a known-good manifest. Do not trust files based on their path alone. 4. **Isolate Tenants:** Use separate, non-shared volumes for untrusted or experimental workloads. Avoid mixing production and development tenants on the same shared cache.
Patch Details
This is a misconfiguration vulnerability. Mitigation requires changes to Kubernetes deployment manifests and storage architecture.