NVIDIA CUDA Driver Kernel-Mode Flaw Allows Container Escape and Host Denial of Service
Overview
A high-severity vulnerability was identified in the NVIDIA CUDA driver for both Linux and Windows operating systems. The flaw resides in the kernel-mode driver component responsible for processing memory mapping and command submissions from user-space applications, including those running within containers. A specially crafted sequence of CUDA API calls can trigger a race condition or an integer overflow when allocating shared memory regions on the GPU. An attacker with access to a container with GPU passthrough (a common configuration for ML training and inference workloads in Kubernetes) can exploit this vulnerability. Successful exploitation could lead to two primary impacts. Firstly, it could cause a fatal kernel error, resulting in a complete denial of service (DoS) of the host machine, crashing all running containers and services. Secondly, and more critically, the memory corruption could be leveraged to read or write to arbitrary host kernel memory. This could potentially allow the attacker to escalate privileges and escape the container's isolation, gaining code execution capabilities on the underlying host node. The vulnerability affects a wide range of NVIDIA drivers and poses a significant risk to multi-tenant cloud environments and on-premise GPU clusters where untrusted or semi-trusted code is executed.
Affected Systems
Testing Guide
1. **Check Driver Version:** On Linux, run `nvidia-smi` to display the installed driver version. On Windows, check the NVIDIA Control Panel under 'System Information'. 2. **Compare with Bulletin:** Cross-reference the installed version with the 'Affected Products' section of the relevant NVIDIA Security Bulletin. 3. **Run NVIDIA Security Scanners:** Utilize any security scanning tools provided by NVIDIA or third-party vulnerability scanners that have checks for this specific CVE.
Mitigation Steps
1. **Update NVIDIA Drivers:** Immediately apply the security patches provided by NVIDIA by updating to the latest available driver version for your specific GPU and operating system. 2. **Restrict GPU Access:** In multi-tenant environments, avoid providing direct GPU passthrough to untrusted containers. Use virtualization technologies like NVIDIA vGPU with stronger isolation guarantees where possible. 3. **Use Secure Base Images:** Ensure that ML container images are built from minimal, hardened base images and that all system packages, including the CUDA toolkit within the container, are regularly updated. 4. **Monitor Host System:** Implement host-level intrusion detection and monitoring to detect anomalous kernel activity or crashes that could indicate an exploitation attempt.
Patch Details
Patched in driver versions 550.54.14 (Linux) and 551.61 (Windows) and later, as detailed in the April 2025 NVIDIA Security Bulletin.