GPU Memory Isolation Bypass Across Pods in Multi-Tenant Kubernetes Clusters
Overview
A high-severity vulnerability was disclosed affecting multi-tenant Kubernetes clusters using NVIDIA GPUs for ML workloads. The flaw resided in the interaction between the NVIDIA device drivers and the Kubernetes device plugin, which failed to properly clear GPU VRAM between different containerized workloads being scheduled on the same physical GPU. An attacker could deploy a malicious pod to a shared cluster node. By carefully timing their memory allocation requests to coincide with the de-scheduling of a victim's pod, the attacker's pod could be allocated pages of VRAM that still contained data from the previous workload. This 'data remanence' allowed the attacker to read sensitive information directly from GPU memory, such as training data batches, proprietary model weights, or user inputs from an inference service running in the victim pod. The vulnerability bypasses standard container isolation mechanisms, as the resource sharing occurs at the hardware driver level. Discovery was made by cloud security researchers simulating a malicious tenant in a production-like GPU-accelerated Kubernetes environment. The impact is significant for cloud providers and enterprises running untrusted ML workloads on shared infrastructure.
Affected Systems
Testing Guide
1. **Setup**: On a test cluster node with a vulnerable driver, deploy a 'victim' pod that allocates a large chunk of VRAM and fills it with a known data pattern. 2. **Run Attacker Pod**: Immediately after terminating the victim pod, schedule an 'attacker' pod on the same node. This pod should allocate a large VRAM block and dump its initial contents to a file. 3. **Analyze Dump**: Analyze the memory dump from the attacker pod. If it contains remnants of the known data pattern from the victim pod, the cluster is vulnerable. This test requires specialized tooling for low-level GPU memory access.
Mitigation Steps
1. **Update NVIDIA Components**: Upgrade all node drivers to a patched version (550.x.x or newer) and update the NVIDIA Container Toolkit and device plugin to the latest versions. 2. **Workload Isolation**: For highly sensitive workloads, use Kubernetes node taints and tolerations to dedicate entire physical nodes with GPUs to a single tenant, preventing co-location of untrusted pods. 3. **Disable MPS**: Consider disabling NVIDIA Multi-Process Service (MPS) on nodes where untrusted tenants run, as it can complicate isolation boundaries. 4. **Monitor GPU Usage**: Implement monitoring to detect anomalous GPU memory access patterns that could indicate an exploitation attempt.
Patch Details
The issue was addressed in NVIDIA GPU Driver version 550.x.x and NVIDIA Container Toolkit 1.15.0, which enforce stricter memory scrubbing between process context switches on the GPU.