NVIDIA GPU Driver Kernel Mode Layer Allows for Privilege Escalation in Multi-Tenant AI Clusters
Overview
A high-severity vulnerability, identified as CVE-2024-0073, was found in the kernel mode driver for a wide range of NVIDIA GPU products. The flaw resides in the way the driver handles memory mapping for CUDA operations. An improper input validation check can lead to an out-of-bounds write in kernel memory. A local, low-privileged user can trigger this vulnerability by submitting a specially crafted CUDA compute job to the GPU. In a multi-tenant environment, such as a shared JupyterHub instance, a university research cluster, or a cloud provider's GPU-as-a-service offering, any user with access to the GPU can attempt exploitation. A successful exploit could lead to a Denial of Service (DoS), causing the entire host machine to crash and disrupting all workloads. More critically, it could be leveraged for arbitrary code execution with kernel-level privileges, allowing a complete takeover of the host system. This would enable an attacker to access data from other users' workloads, escape container boundaries, and potentially move laterally across the infrastructure. The vulnerability was discovered through fuzzing of the CUDA API by an independent security researcher.
Affected Systems
Testing Guide
1. Check your currently installed NVIDIA driver version. On Linux, run `nvidia-smi`. On Windows, check the NVIDIA Control Panel. 2. Compare your installed version against the patched versions listed in the NVIDIA security bulletin for CVE-2024-0073. 3. If your driver version is lower than the patched version, your system is vulnerable and should be updated immediately.
Mitigation Steps
1. **Update Drivers**: Immediately update all NVIDIA drivers on affected systems to the versions specified in the NVIDIA security bulletin. 2. **Workload Isolation**: Utilize strong isolation mechanisms for GPU workloads. For Kubernetes, use tools like gVisor or Kata Containers for pods requiring GPU access. 3. **Restrict GPU Access**: Limit direct GPU access to trusted users and processes. Implement strict scheduling and resource quotas. 4. **Monitor System Logs**: Monitor kernel logs for signs of instability or anomalous errors related to the NVIDIA driver, which could indicate attempted exploitation.
Patch Details
Patches are available in NVIDIA driver branches 550 and later. See NVIDIA Security Bulletin 5551 for specific version details.