NVIDIA GPU Driver Kernel Mode Layer Contains Out-of-Bounds Write Vulnerability Leading to Privilege Escalation
Overview
NVIDIA released a security bulletin detailing a high-severity vulnerability in its Linux GPU driver. The vulnerability, tracked as CVE-2024-0073, is an out-of-bounds write flaw within the driver's kernel mode layer. An attacker who has local user access to a system with an affected driver can craft a specific sequence of API calls to the GPU, triggering the out-of-bounds write. This can lead to denial of service (crashing the entire system) or, more critically, arbitrary code execution with kernel-level privileges. The primary impact is on multi-tenant AI infrastructure, such as Kubernetes clusters using GPU sharing or time-slicing. A user with access to a low-privilege container with a GPU slice could exploit this vulnerability to escape the container and gain root access on the underlying host node. From there, the attacker could compromise all other containers on the node, accessing sensitive training data, stealing proprietary models, or disrupting critical AI inference services. The discovery of this vulnerability emphasizes the importance of maintaining a secure infrastructure stack, including hardware drivers, especially in shared GPU environments which are becoming standard for cost-effective MLOps.
Affected Systems
Testing Guide
1. **Check Driver Version:** On a Linux system with an NVIDIA GPU, run `nvidia-smi` to display the installed driver version. 2. **Compare with Affected Versions:** Cross-reference the installed version with the patched versions listed in the NVIDIA security bulletin (e.g., 550.54.14, 535.154.05). 3. **Scan for Vulnerabilities:** Use a system vulnerability scanner (e.g., Trivy, Qualys) that has been updated with the latest CVE definitions to automatically detect if your installed driver version is vulnerable.
Mitigation Steps
1. **Update NVIDIA Drivers:** Immediately update all affected systems to the patched driver versions listed in the NVIDIA security bulletin. 2. **Restrict GPU Access:** In multi-tenant environments, use mechanisms like Kubernetes admission controllers or Linux cgroups to limit which pods or users can access GPU devices. 3. **Use Secure Runtimes:** Employ container runtimes with stronger isolation, such as gVisor or Kata Containers, for untrusted GPU workloads, although this may have performance implications. 4. **Monitor Kernel Logs:** Monitor for unexpected kernel panics or driver errors, which could be an indicator of exploitation attempts.
Patch Details
Patched versions are available from NVIDIA, including 550.54.14, 545.29.06, and 535.154.05, depending on the driver branch.