NVIDIA CUDA Driver Privilege Escalation and Container Escape in Multi-Tenant GPU Clusters
Overview
A critical use-after-free vulnerability was discovered in the NVIDIA kernel mode driver for Linux (`nvidia.ko`). The flaw could be triggered when processing malformed memory mapping arguments through the CUDA API. An unprivileged user within a container that has been granted GPU access could exploit this vulnerability to achieve arbitrary kernel code execution. This completely compromises the isolation boundary between the container and the host operating system. In multi-tenant AI/ML cloud environments, such as Kubernetes clusters with GPU sharing, a malicious tenant could use this exploit to escape their container, gain root privileges on the host node, and subsequently access or disrupt the workloads and data of all other tenants on the same physical machine. The vulnerability resided in the driver's handling of unified memory management, where insufficient locking and state validation allowed a user-space process to deallocate a memory object while it was still in use by the kernel, leading to the exploitable use-after-free condition. This discovery underscored the security-critical nature of GPU drivers in virtualized infrastructure and the high-impact potential of hardware-adjacent vulnerabilities.
Affected Systems
Testing Guide
1. Check the currently installed NVIDIA driver version on your host systems using the `nvidia-smi` command. 2. Compare the installed version against the patched versions listed in the NVIDIA security bulletin (e.g., 550.78 for the R550 branch). 3. If your version is lower than the patched version for your release branch, the system is vulnerable. 4. Running a proof-of-concept exploit is highly discouraged on production systems. Rely on version checking for verification.
Mitigation Steps
1. **Apply Driver Updates**: Immediately update all affected NVIDIA drivers on host machines to the patched versions specified in the NVIDIA security bulletin. 2. **Use gVisor or Kata Containers**: For workloads requiring strong isolation, use sandboxed container runtimes like gVisor or Kata Containers, which can intercept and virtualize system calls, mitigating kernel-level exploits. 3. **Restrict GPU Access**: Only grant GPU access to trusted workloads and tenants. Avoid sharing physical GPUs between mutually untrusting parties if possible. 4. **Monitor Kernel Logs**: Implement monitoring for anomalous kernel-level activity and GPU driver errors, which could indicate an exploitation attempt.
Patch Details
NVIDIA released patched driver versions 550.78 and 535.183.01 to address this vulnerability.