NVIDIA GPU Driver Use-After-Free Allows Privilege Escalation from Containerized Workloads
Overview
A high-severity use-after-free vulnerability was discovered in the kernel mode layer of the NVIDIA GPU driver for Linux. The flaw, tracked as CVE-2024-0072, can be triggered by a local, low-privileged user through specially crafted interactions with the driver's memory management components. The primary impact is in multi-tenant GPU environments, which are common in cloud AI services and on-premise Kubernetes clusters that use GPU sharing or MIG (Multi-Instance GPU). An attacker who has compromised a single container with GPU access can exploit this vulnerability to trigger a denial of service (crashing the host node) or, more critically, achieve arbitrary code execution in the host's kernel. A successful exploit would allow the attacker to break out of the container sandbox, gain root-level privileges on the underlying host node, and subsequently access or disrupt all other containerized workloads running on that same node. This could lead to the theft of sensitive data, AI models, or credentials from other tenants, effectively collapsing the security isolation provided by containerization. The vulnerability underscores the critical importance of maintaining a secure and updated infrastructure stack, from the physical hardware drivers up to the application layer, in AI/ML environments.
Affected Systems
Testing Guide
1. **Check Driver Version:** On the GPU host node, run the command `nvidia-smi`. 2. **Verify Version:** Compare the 'Driver Version' reported in the output with the list of affected versions. If your version is lower than the patched versions (e.g., you have 535.150.00), you are vulnerable. 3. **Consult NVIDIA Bulletin:** Cross-reference your driver branch (e.g., 535.x, 550.x) with the official NVIDIA Security Bulletin for CVE-2024-0072 to confirm.
Mitigation Steps
1. **Update GPU Drivers:** Immediately update all affected NVIDIA drivers on host machines to the patched versions specified in the NVIDIA security bulletin. 2. **Implement Node Isolation:** In Kubernetes, use taints and tolerations to isolate workloads with different trust levels onto separate physical nodes. 3. **Use Secure Runtimes:** Employ security-hardened container runtimes like gVisor or Kata Containers, which provide an additional layer of kernel isolation between the container and the host. 4. **Monitor Kernel Logs:** Actively monitor host kernel logs (`dmesg`) for signs of driver instability or crashes, which could indicate attempted exploitation.
Patch Details
Patches are available in NVIDIA driver versions 550.54.14, 545.29.06, 535.161.07 and later.