NVIDIA GPU Driver Kernel Vulnerability Allows Container Escape
Overview
A high-severity vulnerability was discovered in the kernel mode layer of the NVIDIA GPU Display Driver for Linux, posing a significant threat to containerized AI/ML infrastructure. The vulnerability, identified as CVE-2024-0074, stems from improper input validation within the driver's kernel module when processing user-mode inputs. An attacker with low-privilege access inside a container that has GPU access (a common configuration for ML workloads using Docker with `--gpus all`) can exploit this flaw. By crafting a malicious CUDA application or model that sends specially formed data to the GPU, the attacker can trigger a buffer overflow or a race condition in the kernel. Successful exploitation allows the attacker to achieve arbitrary code execution with the full privileges of the host machine's kernel. This effectively shatters the isolation boundary of the container, allowing the attacker to 'escape' to the host operating system. From there, the attacker can compromise the entire node, access data from other containers, steal sensitive models or data, and use the compromised machine as a launchpad for further attacks within the network. This vulnerability is critical in multi-tenant GPU clusters where different users or processes are meant to be securely isolated from one another.
Affected Systems
Testing Guide
1. **Check Driver Version**: On the host machine, run the command `nvidia-smi` to display the installed NVIDIA driver version. 2. **Compare with Patched Versions**: Compare the installed version with the patched versions listed in the NVIDIA security advisory. If your version is lower than the fixed versions in your release branch, you are vulnerable. 3. **Review Container Configurations**: Audit your container orchestration system (e.g., Kubernetes, Docker Swarm) to identify all pods or containers that are configured with GPU access.
Mitigation Steps
1. **Update NVIDIA Drivers**: Immediately update the NVIDIA drivers on all host machines to a patched version as specified in the NVIDIA security bulletin. 2. **Limit GPU Access**: Only grant GPU access to trusted containers and workloads. Avoid running untrusted code in containers with GPU pass-through. 3. **Use Hardened Kernels**: Employ kernel hardening features and Linux Security Modules (LSMs) like AppArmor or SELinux to further restrict the capabilities of containerized processes. 4. **Implement Egress Filtering**: Block network access from GPU-enabled containers unless absolutely necessary to limit an attacker's ability to download exploits or exfiltrate data post-compromise.
Patch Details
NVIDIA released patched driver versions 550.40.07, 535.154.05, 525.147.05, and 470.223.02 to address this vulnerability.