NVIDIA GPU Driver Improper Input Validation Leading to Privilege Escalation on ML Hosts
Overview
A high-severity privilege escalation vulnerability was identified in the NVIDIA GPU Display Driver for Windows. The flaw resides in the kernel mode layer component (`nvlddmkm.sys`), which fails to properly validate input from a user-mode process. This allows a local attacker with basic user privileges to craft a malicious request that causes a write to an arbitrary memory offset in the kernel address space. By carefully controlling the written data and offset, an attacker can corrupt critical kernel data structures, leading to denial of service (system crash) or, more significantly, arbitrary code execution with SYSTEM-level privileges. In the context of AI and machine learning infrastructure, this vulnerability is particularly dangerous. An attacker who gains initial access to a containerized ML workload (e.g., a Jupyter notebook or a training job) running on a shared GPU host could exploit this vulnerability to escape the container, compromise the underlying host operating system, and gain full control over the node. This would allow them to access or disrupt all other GPU workloads on the machine, steal sensitive model data, and potentially pivot to other parts of the network.
Affected Systems
Testing Guide
1. **Check Driver Version (Windows)**: Open the NVIDIA Control Panel, go to `Help > System Information`. Verify that the 'Driver version' is 536.23 or higher. 2. **Check Driver Version (CLI)**: Open a command prompt and run `nvidia-smi`. The driver version is displayed in the top right corner. Ensure it meets the patched version requirement. 3. **Review Security Scans**: Check reports from vulnerability management tools for alerts related to CVE-2023-31024.
Mitigation Steps
1. **Update NVIDIA Drivers**: Immediately update all affected systems to NVIDIA GPU Display Driver version 536.23 or newer by downloading from the official NVIDIA website. 2. **Principle of Least Privilege**: Ensure that processes and containers accessing GPU resources run with the minimum necessary privileges. 3. **Container Hardening**: Use container runtimes with strong isolation guarantees, such as gVisor or Kata Containers, to limit the kernel attack surface accessible from within a container. 4. **Regular Host Scanning**: Implement regular vulnerability scanning of host systems, including drivers, to detect outdated and vulnerable components.
Patch Details
The vulnerability is addressed in NVIDIA GPU Display Driver version 536.23 and all subsequent releases.