NVIDIA CUDA Driver Use-After-Free Vulnerability Enabling GPU-based Container Escape
Overview
A critical use-after-free vulnerability was identified in the kernel-mode layer of the NVIDIA CUDA driver, assigned CVE-2024-0073. The flaw resides in the driver's handling of memory allocation and deallocation for specific CUDA operations. An unprivileged user-mode process can trigger this vulnerability by sending a specially crafted sequence of API calls to the driver, causing it to access a freed memory pointer. In the context of AI and machine learning, this poses a significant threat to multi-tenant GPU clusters and containerized ML workloads. An attacker with the ability to execute code within a container (e.g., through a compromised Jupyter notebook or a malicious ML training script) can exploit this vulnerability to achieve privilege escalation on the host operating system. Successful exploitation allows the attacker to execute arbitrary code with kernel-level privileges, leading to a full container escape. From there, the attacker can compromise the underlying host, access data from other containers running on the same machine, and potentially pivot to other nodes in the cluster. This type of vulnerability is particularly dangerous in cloud AI platforms and on-premise GPU farms where resources are shared among multiple users or teams, as it breaks the isolation boundaries that are fundamental to secure infrastructure.
Affected Systems
Testing Guide
1. **Check Driver Version:** On the Linux host system, run the command `nvidia-smi` to view the installed NVIDIA driver version. 2. **Compare with Patched Versions:** Cross-reference the installed version with the 'Software Updates' section of the NVIDIA security bulletin for CVE-2024-0073. 3. **Vulnerability Scanning:** Use a host-level vulnerability scanner that has been updated with checks for this CVE to scan all GPU-enabled nodes in your infrastructure.
Mitigation Steps
1. **Patch NVIDIA Drivers:** Immediately update all GPU drivers on host machines to the versions specified in the NVIDIA security bulletin (e.g., 551.46, 545.29.06, or newer). 2. **Restrict Driver Access:** Use container runtimes and orchestration platforms (like Kubernetes) with strict seccomp and AppArmor/SELinux profiles to limit the system calls that containers can make to the GPU driver. 3. **Use gVisor or Kata Containers:** For workloads requiring the highest level of isolation, run ML jobs inside containers with stronger kernel-level isolation, such as gVisor or Kata Containers, which intercept and filter driver interactions. 4. **Node-level Segregation:** Assign critical or untrusted workloads to dedicated GPU nodes to limit the blast radius of a potential container escape.
Patch Details
Patched in NVIDIA driver versions 551.46, 545.29.06, 535.161.07 and later.