NVIDIA CUDA Toolkit Driver Vulnerability Allows GPU Memory Hijacking in Multi-Tenant Environments
Overview
A critical use-after-free vulnerability was identified in the NVIDIA CUDA toolkit's kernel-mode driver for Linux. The flaw resides in the driver's memory management unit (MMU) notifier, which improperly handles the unmapping of GPU memory pages when a process terminates. In a multi-tenant AI/ML environment, such as a Kubernetes cluster with GPU sharing or a JupyterHub deployment, a malicious tenant could exploit this vulnerability to gain read/write access to the GPU memory of other users' processes on the same physical GPU. The attack sequence involves the malicious actor allocating a large buffer on the GPU, initiating an asynchronous operation, and then deliberately causing their process to crash or exit abnormally. Due to the use-after-free condition, the driver fails to correctly invalidate pointers to this memory region. A subsequent process scheduled on the same GPU by a different tenant could then be allocated the same physical memory addresses. The attacker's original process, having retained a handle to this memory, could then read sensitive data (e.g., model weights, training data, inference results) or write malicious data to corrupt the victim's model computations. This vulnerability breaks the security isolation between GPU-accelerated containers, posing a severe risk to data confidentiality and integrity in cloud and on-premise AI infrastructure.
Affected Systems
Testing Guide
1. Check the installed NVIDIA driver version by running `nvidia-smi` in the terminal. The version is displayed in the top right corner. 2. If the driver version is older than 550.75, the system is vulnerable. 3. To check the CUDA Toolkit version, run `nvcc --version`. 4. Due to the complexity of exploitation (requiring precise memory layout control and timing), a simple proof-of-concept is difficult. The primary verification method is checking the driver version number.
Mitigation Steps
1. **Update NVIDIA Drivers:** Immediately update the NVIDIA kernel-mode driver on all affected systems to version 550.75 or later. 2. **Update CUDA Toolkit:** Ensure the installed CUDA Toolkit is updated to version 12.5 or newer to get the corresponding patched libraries. 3. **Isolate Workloads:** Where possible, avoid sharing a single physical GPU between mutually untrusted tenants. Use dedicated GPUs per tenant or leverage stronger isolation technologies like NVIDIA MIG (Multi-Instance GPU). 4. **Monitor System Logs:** Monitor kernel logs (`dmesg`) for messages related to GPU page faults or driver errors, which could indicate attempted exploitation.
Patch Details
Patched in NVIDIA Linux driver version 550.75 and corresponding CUDA Toolkit 12.5 release. The patch corrects the memory handling logic in the MMU notifier.