NVIDIA GPU Driver Kernel Mode Layer Allows Privilege Escalation in Multi-Tenant AI Clusters
Overview
A high-severity vulnerability was discovered in the NVIDIA GPU driver for Linux. The flaw, residing in the kernel mode layer's memory mapping functionality, allows a local user with access to a GPU device to trigger a use-after-free or out-of-bounds write condition. In a multi-tenant cloud environment or on-premise Kubernetes cluster where GPU resources are shared among different users or containers, this vulnerability can be exploited for privilege escalation. A malicious user running a specially crafted CUDA application inside a container could leverage this flaw to escape the container's security boundaries and gain root privileges on the host node. This would allow the attacker to access data from other tenants' ML workloads, inject malicious code into their models, disrupt training jobs, or compromise the entire cluster. The vulnerability affects a wide range of data center GPUs and highlights the risks of insufficient isolation in shared AI infrastructure. The issue was responsibly disclosed by cloud security researchers who demonstrated a proof-of-concept exploit in a managed Kubernetes environment, escalating from a standard Jupyter notebook user to full node administrator.
Affected Systems
Testing Guide
1. **Check Driver Version:** On a Linux host with an NVIDIA GPU, run `nvidia-smi` to check the installed driver version. Compare this with the patched version number in the NVIDIA security bulletin. 2. **Query Cloud Provider:** Check your cloud provider's documentation or support channels for information on which machine images contain the patched drivers. 3. **Vulnerability Scanning:** Use a host-level vulnerability scanner that has checks for this specific CVE to scan your GPU nodes.
Mitigation Steps
1. **Update Drivers:** Patch host systems by updating NVIDIA drivers to the latest version recommended by the vendor (550.76 or newer). 2. **Update Cloud Images:** If using cloud VMs, ensure you are using the latest GPU-enabled machine images provided by your cloud provider (AWS, GCP, Azure), as they will include the patched drivers. 3. **Use gVisor or Kata Containers:** For defense-in-depth, run untrusted GPU workloads in stronger isolation environments like gVisor or Kata Containers, which provide an additional kernel boundary. 4. **Restrict GPU Access:** Limit direct GPU access to trusted users and workloads. Use Kubernetes admission controllers to enforce policies on which pods can request GPU resources.
Patch Details
Update to NVIDIA driver version 550.76 or newer. Cloud providers like AWS, GCP, and Azure have also updated their base machine images with the patched drivers.