NVIDIA CUDA Driver Kernel Mode Handler Vulnerability Allows Privilege Escalation in Multi-Tenant GPU Environments
Overview
A high-severity vulnerability was discovered in the NVIDIA CUDA driver for Linux and Windows. The flaw resides in the kernel-mode driver component that handles memory mapping operations from user-mode applications. A lack of proper input validation on a parameter passed to an IOCTL (Input/Output Control) call allows for an out-of-bounds write in kernel memory. A local attacker with low privileges, such as a user running code inside a container with GPU access, can craft a malicious call to the driver to trigger this condition. Successful exploitation can lead to a Denial of Service (DoS) by crashing the entire host system. More critically, advanced exploits could leverage this memory corruption to achieve arbitrary code execution with kernel-level privileges, leading to a complete container escape and host compromise. This vulnerability is particularly dangerous in multi-tenant AI/ML environments like Kubernetes clusters with shared GPUs or cloud-based GPU instances, as it allows one tenant to potentially access other tenants' data or take over the underlying infrastructure.
Affected Systems
Testing Guide
1. Identify the installed NVIDIA driver version on your Linux system by running `nvidia-smi`. 2. Compare the displayed `Driver Version` with the patched versions listed in the official NVIDIA security bulletin for the specific CVE. 3. For Windows, check the driver version in the NVIDIA Control Panel or Device Manager. 4. If the installed version is older than the recommended patched version for your driver branch, the system is vulnerable and should be updated.
Mitigation Steps
1. **Update NVIDIA Drivers:** Update all host systems with NVIDIA GPUs to a patched driver version as specified in the NVIDIA security bulletin. 2. **Restrict GPU Access:** In multi-tenant environments, use mechanisms like Kubernetes device plugins and security contexts to limit which pods can access GPU resources. 3. **Use Sandboxed Containers:** Employ stronger container runtimes like gVisor or Kata Containers for untrusted GPU workloads to provide an additional layer of kernel isolation. 4. **Monitor System Logs:** Monitor kernel logs for signs of driver crashes or memory corruption errors, which could indicate attempted exploitation.
Patch Details
Patched in R550 branch (version 550.90.07 and later) and other long-lived branches. Refer to the NVIDIA security bulletin for a full list of affected and patched versions.