Privilege Escalation via Out-of-Bounds Write in NVIDIA CUDA Driver
Overview
A high-severity vulnerability was discovered in the NVIDIA CUDA Driver kernel module, affecting systems that rely on GPUs for accelerated computing, including most AI training and inference workloads. The vulnerability, identified as an out-of-bounds write, exists in a specific IOCTL handler within the driver. A local attacker with basic user privileges can craft a malicious request to this handler, causing the kernel to write data outside the boundaries of its intended buffer. This memory corruption can be carefully controlled by an attacker to overwrite critical kernel data structures, such as function pointers or process privilege information. Successful exploitation allows the attacker to execute arbitrary code with kernel-level privileges, leading to a full system compromise. The impact is especially severe in multi-tenant environments, such as university compute clusters or cloud-based GPU instances, where a single compromised user account could be leveraged to take over the entire host machine, gaining access to data from all other users and containers running on that node. This vulnerability underscores the critical importance of keeping low-level hardware drivers, which form a major part of the trusted computing base for AI infrastructure, fully patched and up-to-date.
Affected Systems
Testing Guide
1. **Check Driver Version:** On Linux, run the command `nvidia-smi` and check the 'Driver Version' displayed in the header. Compare this version to the patched versions listed in the NVIDIA security bulletin. 2. **Review System Logs:** After a system crash or reboot, check kernel logs (`dmesg` or `journalctl -k`) for messages related to kernel panics, page faults, or memory corruption originating from the NVIDIA driver modules (e.g., `nvidia.ko`). 3. **Use Vulnerability Scanners:** Run a host-based vulnerability scanner that has checks for this specific CVE to confirm if your system is affected.
Mitigation Steps
1. **Update NVIDIA Drivers:** Immediately update to a patched driver version as specified in the NVIDIA security bulletin. For Linux, this is version 535.154.05 or later. 2. **Restrict GPU Access:** In multi-tenant environments, use mechanisms like Kubernetes device plugins and security contexts to limit which users and pods can access GPU devices. 3. **Use Secure Base Images:** Ensure that container images for ML workloads are built on a minimal, hardened base OS and that host OS and kernel are regularly patched. 4. **Monitor Kernel Logs:** Monitor for anomalous kernel-level activity and crashes, which could indicate exploitation attempts.
Patch Details
Patched in NVIDIA GPU Display Driver versions 535.154.05 (Linux) and 538.33 (Windows) and later.