Container Escape via Race Condition in NVIDIA Kubernetes Device Plugin
Overview
A privilege escalation vulnerability was discovered in the NVIDIA Device Plugin for Kubernetes, a critical component for managing GPU resources in containerized AI/ML workloads. The vulnerability stemmed from a time-of-check to time-of-use (TOCTOU) race condition in how the plugin managed GPU device file permissions (`/dev/nvidia*`) during pod lifecycle events. An attacker could create a malicious, GPU-enabled pod on a multi-tenant Kubernetes cluster. By repeatedly forcing their pod to be terminated and rescheduled on the same node, they could exploit a small time window where the device plugin had removed the old pod's cgroup restrictions but had not yet applied the new, more restrictive ones. During this window, the malicious pod could re-open a handle to the GPU device files with elevated permissions inherited from a previously terminated, privileged pod (e.g., one running a training job as a high-privilege service account). With this privileged handle, the attacker could use low-level CUDA APIs to manipulate the GPU's memory management unit (MMU), bypass kernel security, and read arbitrary memory from other processes on the host node. This effectively allowed a complete container escape, leading to host compromise and access to data from all other pods on the node.
Affected Systems
Testing Guide
1. **Check Component Versions:** Use `kubectl describe node <node-name>` and look for the `nvidia.com/device-plugin-version` label to check the running version of the device plugin. 2. **Review Pod Security Contexts:** Audit your running pods to ensure they are not running as root or with hostPath mounts that could exacerbate a breakout. 3. **Attempt a PoC (in a controlled environment):** Security teams can use publicly released proof-of-concept code to test for the race condition on a non-production, isolated cluster node.
Mitigation Steps
1. **Upgrade NVIDIA Components:** Immediately upgrade the NVIDIA Device Plugin to version `v0.15.0` or later, or upgrade the GPU Operator to `v24.1.0` or later, which includes the patched plugin. 2. **Use Hardened Runtimes:** Run GPU workloads using sandboxed container runtimes like gVisor or Kata Containers, which provide an additional layer of kernel isolation and can mitigate the impact of such vulnerabilities. 3. **Apply Principle of Least Privilege:** Do not run training or inference pods with root privileges or unnecessary capabilities. Use Pod Security Standards like `restricted`. 4. **Network Segmentation:** Implement strict network policies to prevent a compromised pod from moving laterally within the cluster.
Patch Details
NVIDIA addressed the race condition in Device Plugin v0.15.0 by enforcing stricter, atomic permission settings during pod setup and teardown.