NVIDIA GPU Driver Kernel Mode Layer Privilege Escalation
Overview
A high-severity vulnerability was discovered in the kernel mode layer of the NVIDIA GPU display driver for Windows and Linux. The flaw resides in how the driver handles memory mapping and shader execution requests from user-mode applications. A specially crafted, malicious CUDA or graphics workload can trigger a race condition or buffer overflow within the `nvlddmkm.sys` (Windows) or `nvidia.ko` (Linux) driver component. An attacker with low-privilege access to execute code on a system with an affected GPU—a common scenario in multi-tenant cloud ML platforms or shared research clusters—can exploit this vulnerability. Successful exploitation allows the attacker to execute arbitrary code with NT AUTHORITY\SYSTEM or root privileges, respectively. This effectively allows an attacker to escape containerized environments (e.g., Docker, Kubernetes pods with GPU passthrough) and gain full control over the underlying host machine. From there, they can compromise all other tenants, access sensitive data, and persist within the infrastructure. The vulnerability underscores the critical importance of keeping GPU drivers, a key part of the AI infrastructure stack, fully patched.
Affected Systems
Testing Guide
1. **Check Driver Version (Windows):** Open the NVIDIA Control Panel, go to 'Help' -> 'System Information'. Compare the 'Driver version' with the patched version (e.g., 555.85 or newer). 2. **Check Driver Version (Linux):** Run the command `nvidia-smi` in the terminal. The driver version is displayed in the top right corner. Compare it with the patched version (e.g., 550.78 or newer). 3. If your driver version is lower than the patched version, the system is vulnerable.
Mitigation Steps
1. **Update Drivers:** Immediately update NVIDIA drivers to the versions specified in the NVIDIA Security Bulletin. 2. **Principle of Least Privilege:** Do not run ML training jobs with root or administrator privileges inside containers. 3. **Use Sandboxed Containers:** For untrusted workloads, use stronger sandboxing technologies like Kata Containers or gVisor, which provide a hardware-virtualized or kernel-emulated boundary between the container and the host. 4. **Node Isolation:** In multi-tenant environments, use node taints and tolerations in Kubernetes to physically isolate workloads from different tenants onto different GPU-enabled nodes. 5. **Egress Filtering:** Implement strict network egress filtering to prevent a compromised host from communicating with an attacker's command-and-control server.
Patch Details
Patches are available in NVIDIA Game Ready Driver 555.85 and NVIDIA Studio Driver 555.85 for Windows, and Linux driver version 550.78.