NVIDIA GPU Driver Kernel Mode Layer Race Condition Allows Privilege Escalation and Container Escape
Overview
A high-severity race condition vulnerability was discovered in the NVIDIA GPU driver's kernel mode layer (`nvlddmkm.sys` on Windows, `nvidia.ko` on Linux). The flaw exists in the driver's handling of memory mapping operations submitted concurrently from user-mode applications. An attacker with local, low-privilege code execution can craft a sequence of specific IOCTL calls that trigger this race condition. Successful exploitation can lead to a write-what-where condition in kernel memory, allowing the attacker to overwrite critical kernel data structures. This can be leveraged to disable security mechanisms, corrupt memory, or ultimately execute arbitrary code with kernel-level (SYSTEM/root) privileges. In the context of AI and ML workloads, this vulnerability is particularly dangerous. Many ML environments rely on containerization (e.g., Docker, Kubernetes) with GPU passthrough for performance. An attacker who has compromised a single, unprivileged container can exploit this driver vulnerability to escape the container's security boundaries and gain full control over the underlying host node. From there, they could compromise all other containers on the same host, steal sensitive models and data, or pivot to other parts of the network. The vulnerability bypasses standard container isolation mechanisms because the driver's kernel component is a shared resource accessible from within the container.
Affected Systems
Testing Guide
1. **Check Driver Version**: On Windows, open the NVIDIA Control Panel and check the driver version under 'System Information'. On Linux, run `nvidia-smi` to display the installed driver version. 2. **Compare with Bulletin**: Cross-reference the installed version with the 'Affected Software' versions listed in the official NVIDIA security bulletin for the specific CVE. 3. **Use Vulnerability Scanner**: Run a host-based vulnerability scanner (e.g., Nessus, Qualys) with up-to-date plugins, which will automatically flag systems with outdated and vulnerable NVIDIA drivers.
Mitigation Steps
1. **Update NVIDIA Drivers**: Immediately update all drivers on host machines to the patched versions specified in the NVIDIA security bulletin. 2. **Isolate GPU Workloads**: When possible, run untrusted or multi-tenant GPU workloads on physically separate, dedicated hardware to limit the blast radius of a host compromise. 3. **Use gVisor or Kata Containers**: For an additional layer of security, use sandboxed container runtimes like gVisor or Kata Containers, which intercept and filter system calls to the host kernel, potentially mitigating kernel exploit attempts. 4. **Restrict Privileged Container Operations**: In Kubernetes, use Pod Security Policies or Admission Controllers to prevent containers from using host features that could aid in exploitation, and avoid running containers in privileged mode.
Patch Details
Patched versions are available via NVIDIA's driver download page and are detailed in their official security bulletin for this vulnerability.