NVIDIA GPU Driver Use-After-Free Vulnerability Allows for Privilege Escalation in ML Workloads
Overview
A high-severity use-after-free vulnerability was identified in the NVIDIA kernel mode driver for Linux. The vulnerability can be triggered by a local user with basic access to the GPU device, a common scenario in multi-tenant AI/ML development servers and Kubernetes clusters with GPU sharing. The flaw exists in the driver's memory management component responsible for handling CUDA-related operations. A specifically crafted sequence of API calls can create a race condition where a kernel object is freed but a reference to it is maintained. A subsequent operation can then use this dangling pointer to write controlled data to an arbitrary memory location that has since been reallocated for other kernel purposes. An attacker can exploit this to corrupt critical kernel data structures, leading to a system crash (Denial of Service) or, more significantly, to execute arbitrary code with kernel-level privileges. This effectively allows an attacker to break out of their user-level process or, in a containerized environment, escape the container's isolation and gain full control over the host node. This is particularly dangerous in cloud and on-premise AI clusters where multiple users or workloads share physical GPUs, as a compromise of one workload could lead to the compromise of the entire host and all other workloads running on it. The vulnerability underscores the critical importance of keeping GPU drivers, a key part of the AI infrastructure stack, patched and up-to-date.
Affected Systems
Testing Guide
1. On a Linux system with an NVIDIA GPU, identify the installed driver version by running the `nvidia-smi` command. 2. Compare the displayed `Driver Version` with the patched versions listed in the NVIDIA security bulletin for CVE-2024-0071. 3. If the installed version is lower than the specified patched versions for its release branch (e.g., you have `545.29.05` when `545.29.06` is the fix), the system is vulnerable.
Mitigation Steps
1. **Update Drivers**: Immediately update all affected NVIDIA drivers to a patched version as specified in the official NVIDIA Security Bulletin. 2. **Restrict GPU Access**: On multi-tenant systems, use mechanisms like Kubernetes device plugins and security policies (e.g., Pod Security Policies, Kyverno) to limit GPU access to only trusted workloads. 3. **Use Hardened Kernels**: Employ kernel hardening features like Kernel Page-Table Isolation (KPTI) and Control-Flow Integrity (CFI) to make exploitation more difficult. 4. **Monitor for Anomalies**: Monitor GPU and system-level logs for signs of anomalous behavior or driver crashes that could indicate an exploitation attempt.
Patch Details
Patched in NVIDIA driver versions R550 (550.40.07), R545 (545.29.06), and others. Refer to the official NVIDIA Security Bulletin 5521 for a full list of affected and fixed versions.