NVIDIA CUDA Driver Use-After-Free Vulnerability Allows GPU Container Escape to Host System
Overview
A use-after-free vulnerability was identified in the NVIDIA kernel-mode driver's handling of Unified Memory (UVM) mappings. This flaw can be triggered by a malicious process running within a GPU-enabled container, such as Docker or a Kubernetes pod. An attacker can craft a sequence of CUDA API calls that manipulate memory allocation and GPU channel submission, creating a race condition. This leads to the kernel driver retaining a stale pointer to a freed memory object. By subsequently reallocating and controlling the contents of that memory region, the attacker can corrupt kernel data structures. Successful exploitation allows the attacker to overwrite function pointers in the kernel, achieving arbitrary code execution in the context of the host's kernel. This effectively breaks the container isolation boundary, providing the attacker with full root access to the underlying host node. This is a critical risk for multi-tenant AI cloud providers and on-premise Kubernetes clusters that share GPU resources, as a single compromised tenant workload could take over the entire physical machine, accessing or disrupting all other containerized workloads on that node.
Affected Systems
Testing Guide
1. **Check Driver Version:** On the host system, run `nvidia-smi` to check the installed driver version. 2. **Compare with Patched Versions:** Compare the reported version against the patched versions listed in the NVIDIA security bulletin. If your version is lower, the system is vulnerable. 3. **Run Vendor-Provided Scanner:** Use any vulnerability scanning tools provided by NVIDIA or your OS vendor that have been updated with signatures for this CVE. 4. **Monitor Kernel Logs:** In a controlled test environment, run proof-of-concept exploit code (if available) and monitor `dmesg` or `/var/log/kern.log` on the host for signs of kernel panics, taints, or memory corruption errors originating from the NVIDIA driver.
Mitigation Steps
1. **Update NVIDIA Drivers:** Immediately update all host systems to the latest patched NVIDIA driver version as specified in the security bulletin. 2. **Use Secure Computing Mode (seccomp):** Apply strict seccomp profiles to GPU-enabled containers to limit the syscalls available to the container, potentially disrupting exploitation primitives. 3. **Isolate Untrusted Workloads:** When possible, run untrusted or experimental AI workloads on dedicated, physically isolated hardware clusters. 4. **Enable IOMMU/vSGA:** In virtualized environments, ensure that IOMMU (Input-Output Memory Management Unit) is enabled to provide an additional layer of hardware-enforced memory isolation between the VM and the host.
Patch Details
Patched in NVIDIA driver versions 550.54.14 (Linux) and 551.78 (Windows).