NVIDIA GPU Driver Out-of-Bounds Write Leading to Denial of Service or Privilege Escalation
Overview
A high-severity vulnerability was discovered in the NVIDIA GPU display driver's kernel mode layer (`nvlddmkm.sys` on Windows, `nvidia.ko` on Linux). The vulnerability allows a local user with basic privileges to cause an out-of-bounds write by supplying specially crafted shader data to the driver. An unprivileged user-mode application can trigger this condition, leading to a system crash (Denial of Service) or, in some scenarios, arbitrary code execution with kernel-level privileges. In the context of AI and ML workloads, this poses a significant threat in multi-tenant environments. A malicious user with access to run a training job on a shared GPU instance could package a malicious shader within a seemingly harmless model or data processing script. When the workload is executed, it could trigger the vulnerability to crash the entire host node, disrupting all other GPU-based workloads, or potentially escape its container to gain control over the host operating system. This would allow the attacker to access data from other tenants' workloads, compromise the host, and move laterally within the cloud or data center environment.
Affected Systems
Testing Guide
1. **Check Driver Version**: On Linux, run `nvidia-smi` to get the installed driver version. On Windows, check the NVIDIA Control Panel. 2. **Compare with Patched Versions**: Compare your installed version with the patched versions listed in the NVIDIA Security Bulletin. If your version is lower, you are affected. 3. **Active Exploitation (Not Recommended)**: Probing for this vulnerability requires specialized code and can crash the target system. It is not recommended to test for exploitability outside of a controlled security research environment. Version checking is the safest method.
Mitigation Steps
1. **Update Drivers Immediately**: Update all NVIDIA drivers on workstations and servers to the latest version provided by NVIDIA or the cloud service provider. 2. **Isolate GPU Workloads**: Use strong virtualization and containerization technologies (e.g., gVisor, Kata Containers) to limit the impact of a potential container escape. 3. **Restrict User Access**: Limit direct access to GPU nodes to trusted users. Implement strict access control and auditing for submitting ML jobs. 4. **Monitor System Logs**: Monitor kernel logs for signs of driver crashes or instability, which could indicate attempts to exploit this vulnerability.
Patch Details
Patches are available in NVIDIA driver versions 538.15 (Windows) and 535.129.03 (Linux) and later.