NVIDIA GPU Driver Vulnerability Allows Denial-of-Service in Multi-Tenant AI Environments
Overview
A high-severity vulnerability (CVE-2024-0073) was identified in the NVIDIA GPU driver's kernel mode layer, posing a significant threat to multi-tenant AI and ML cloud infrastructure. The vulnerability stems from improper input validation when processing data passed from user-mode applications to the kernel driver. An attacker with local user access, such as a customer in a shared GPU compute environment, could craft a malicious CUDA application or a specially formed model that sends malformed data to the GPU driver. When the kernel driver attempts to process this input, it can lead to a race condition or a null pointer dereference, causing a kernel panic and a complete crash of the host operating system. This results in a Denial of Service (DoS) that affects all tenants sharing the physical GPU. In some scenarios, such a flaw could also be exploited for local privilege escalation (LPE), allowing the attacker to escape their containerized environment and gain administrative control over the host node. This vulnerability is particularly dangerous for cloud service providers offering GPU-accelerated virtual machines or Kubernetes-based ML platforms, as it breaks the isolation model between tenants and allows one malicious user to impact service availability for all others on the same hardware.
Affected Systems
Testing Guide
1. Log into the Linux host system that has the NVIDIA GPU. 2. Run the command `nvidia-smi`. 3. In the top right corner of the output, check the 'Driver Version'. 4. Compare your installed version to the patched versions. For example, if your version is `535.129.03`, you are vulnerable and must update to `535.161.07` or newer.
Mitigation Steps
1. **Update NVIDIA Drivers**: Immediately update all host systems to the patched driver versions specified in the NVIDIA security bulletin. 2. **Implement Workload Scrutiny**: In multi-tenant environments, consider scanning or validating custom CUDA kernels or serialized models submitted by users for known malicious patterns. 3. **Enhance Isolation**: Use stronger isolation technologies like Kata Containers or gVisor for running untrusted workloads on shared GPUs, which can intercept and validate syscalls to the kernel driver. 4. **Monitor System Logs**: Continuously monitor kernel logs and system stability for signs of driver crashes or unusual GPU behavior that could indicate an attack attempt.
Patch Details
Patches are available in NVIDIA driver versions 535.161.07 (Linux) and 551.61 (Windows) and later.