NVIDIA GPU Display Driver Improper Input Validation Leading to Privilege Escalation
Overview
A high-severity vulnerability was found in the kernel-mode layer of NVIDIA's GPU display drivers for Windows. The driver component responsible for processing input from user-mode applications failed to perform adequate validation on this data. This flaw allowed a local attacker with basic user privileges to craft a malicious request to the driver. When the kernel-mode driver processed this malformed input, it could lead to a variety of out-of-bounds memory access errors. The consequences of such an exploit include denial of service (DoS), causing a system crash (Blue Screen of Death), or, more critically, arbitrary code execution with the highest system privileges (NT AUTHORITY\SYSTEM). In the context of AI and ML, this is especially dangerous on multi-tenant systems like shared development servers, JupyterHub instances, or Kubernetes clusters with GPU sharing. A malicious user or a compromised low-privilege application could exploit this vulnerability to escape its container or user session and gain complete control over the host node, compromising all other workloads, data, and models on that machine. The vulnerability highlights the critical importance of keeping low-level hardware drivers, a key part of the AI infrastructure stack, patched and up-to-date.
Affected Systems
Testing Guide
1. Identify the currently installed NVIDIA driver version on your Windows system. This can be found in the NVIDIA Control Panel under 'System Information' or by running `nvidia-smi` in the command line. 2. Compare the installed version against the patched versions listed in the NVIDIA Security Bulletin (e.g., 551.86 for the GeForce Game Ready Driver). 3. If your installed version is lower than the patched version, your system is vulnerable and should be updated immediately.
Mitigation Steps
1. **Update Drivers:** Download and install the latest NVIDIA GPU driver version from the official NVIDIA website or through your cloud provider's update mechanism. 2. **Implement Principle of Least Privilege:** Run GPU-accelerated workloads as non-privileged users and within sandboxed environments (e.g., containers) whenever possible to limit direct access to the kernel driver. 3. **Use Sandboxing Technologies:** For multi-tenant Kubernetes clusters, consider using technologies like gVisor or Kata Containers to provide stronger isolation between pods and the host kernel. 4. **Regular Vulnerability Scanning:** Use host-level vulnerability scanners to detect outdated and vulnerable drivers on systems with GPUs.
Patch Details
The vulnerability was fixed in NVIDIA driver version 551.86 and corresponding versions for other driver branches, as detailed in the March 2024 security bulletin.