NVIDIA GPU Driver Kernel Mode Layer Flaw Allows for Privilege Escalation in Multi-Tenant AI Clusters
Overview
A high-severity vulnerability was discovered in the kernel mode layer of NVIDIA's widely used GPU drivers for Linux. The flaw is an out-of-bounds write that can be triggered by a specially crafted CUDA kernel submitted by a user-space process. In multi-tenant environments like Kubernetes clusters with GPU sharing (e.g., MIG - Multi-Instance GPU) or cloud-based ML training platforms, an unprivileged user or a containerized workload could exploit this vulnerability. A successful exploit could, at a minimum, lead to a Denial of Service (DoS) by crashing the host kernel, disrupting all workloads on the physical machine. More critically, advanced exploitation could allow for the out-of-bounds write to be controlled, enabling an attacker to overwrite specific kernel memory structures. This could lead to arbitrary code execution within the kernel, resulting in a full privilege escalation on the host node. This vulnerability breaks the security boundary between different ML jobs and tenants, allowing an attacker to escape their container, gain root access to the underlying host, and potentially access or compromise other tenants' models and sensitive data. The vulnerability underscores the critical importance of the hardware and driver security stack in securing AI infrastructure.
Affected Systems
Testing Guide
1. Check the currently installed NVIDIA driver version on your Linux hosts by running `nvidia-smi`. 2. The driver version is displayed in the top right corner of the output. 3. If the displayed version is less than `555.42.02`, the system is vulnerable and should be updated. 4. Public proof-of-concept exploit code may not be available or safe to run. Version checking is the most reliable method of confirming vulnerability status.
Mitigation Steps
1. **Patch Drivers:** Update all NVIDIA drivers on host machines to the patched version (555.42.02 or newer) as soon as possible. 2. **Use VM-based Isolation:** For workloads requiring the highest level of security, use full virtual machines with GPU passthrough for isolation instead of containerization on a shared kernel. 3. **Restrict GPU Access:** Limit GPU access to only trusted workloads and users. 4. **Implement Egress Filtering:** Apply strict network egress filtering to GPU-enabled nodes to prevent compromised nodes from communicating with attacker-controlled servers. 5. **Monitor Kernel Logs:** Actively monitor for unexpected kernel panics or driver errors on GPU nodes, which could indicate exploitation attempts.
Patch Details
NVIDIA released patched driver version 555.42.02 which corrects the out-of-bounds write condition.