NVIDIA CUDA Driver Use-After-Free Vulnerability Enables GPU Memory Hijacking and Host Escape
Overview
A critical use-after-free vulnerability was discovered in the memory management component of the NVIDIA CUDA driver for Linux and Windows. The flaw can be triggered by a user-mode process, such as a containerized ML training job, sending a specific sequence of API calls to allocate, map, and deallocate GPU memory. Due to improper reference counting, the driver could release a memory object while a pointer to it was still retained. An attacker's process could then reallocate that memory region, gaining control over the dangling pointer. This allows the attacker to write to or read from arbitrary GPU memory addresses. In a multi-tenant GPU environment (e.g., using MIG - Multi-Instance GPU), this allows an attacker in one container to access the memory of another tenant's model, stealing proprietary data or weights. More severely, by corrupting shared memory structures used by the host kernel, the vulnerability could be escalated to achieve arbitrary code execution on the host operating system, leading to a full container escape and compromise of the underlying node.
Affected Systems
Testing Guide
1. Identify the installed NVIDIA driver version on your system. On Linux, run `nvidia-smi`. 2. Compare the installed version against the patched versions listed in the NVIDIA security bulletin (e.g., 550.90.07 for the 550 series). 3. If your version is lower than the patched version for its respective branch, the system is vulnerable. 4. A specific proof-of-concept exploit binary can be run within a container to confirm exploitability; however, this can crash the system and should only be done in non-production environments.
Mitigation Steps
1. **Update NVIDIA Drivers**: Immediately update all host systems to the patched driver versions specified in the NVIDIA security bulletin. 2. **Use GPU-Aware Sandboxing**: For untrusted workloads, use stronger isolation mechanisms like Kata Containers or gVisor, which can mitigate kernel-level exploits by providing a separate guest kernel. 3. **Restrict Privileged Access**: Avoid running ML workloads with unnecessary privileges. Do not grant containers `CAP_SYS_ADMIN` or access to the host device filesystem. 4. **Monitor GPU Activity**: Use monitoring tools to detect anomalous GPU memory usage patterns that could indicate an exploitation attempt.
Patch Details
Patched drivers (e.g., Linux 550.90.07+, Windows 552.12+) were released by NVIDIA to correct the memory management logic and prevent the use-after-free condition.