Privilege Escalation in NVIDIA GPU Driver Allows Container Escape in Multi-Tenant AI Workloads
Overview
A high-severity vulnerability was identified in the NVIDIA GPU driver for Linux, enabling privilege escalation from a containerized environment. The flaw, tracked as CVE-2025-28431, is an improper access control issue within the driver's kernel mode layer. In cloud environments utilizing GPU sharing technologies like NVIDIA's Multi-Instance GPU (MIG), this vulnerability poses a significant risk. A malicious actor with access to a container on a shared GPU can send a specially crafted IOCTL call to the NVIDIA driver. This exploits the access control flaw, allowing the attacker to write to arbitrary kernel memory. Successful exploitation allows the attacker to break out of the container's isolation, gain full root privileges on the host operating system, and potentially access or disrupt the workloads of all other tenants sharing the same physical GPU. This compromises data confidentiality and integrity for all AI models and data being processed on the compromised node. Cloud providers and on-premise Kubernetes cluster operators running multi-tenant AI workloads were urged to patch immediately to prevent cross-tenant attacks.
Affected Systems
Testing Guide
1. **Check Driver Version:** On the host system, run `nvidia-smi` to check the installed driver version. Compare this with the patched version number listed in the security advisory. 2. **Use Vulnerability Scanners:** Run an infrastructure vulnerability scanner (e.g., Trivy, Qualys) against your host systems and container images to detect outdated and vulnerable driver components. 3. **Review Security Contexts:** Audit your Kubernetes pod security policies or security context constraints to ensure that GPU-enabled pods are not running with unnecessary privileges.
Mitigation Steps
1. **Update NVIDIA Drivers:** Immediately update all host systems to the patched NVIDIA driver version (550.40.10 or newer) as specified in the NVIDIA security bulletin. 2. **Use Secure Runtimes:** Employ container runtimes with stronger sandboxing capabilities, such as gVisor or Kata Containers, to provide an additional layer of defense-in-depth. 3. **Limit Privileged Containers:** Disallow the use of privileged containers (`--privileged` flag in Docker) for ML workloads unless absolutely necessary, and use security contexts to restrict capabilities. 4. **Monitor Kernel Logs:** Actively monitor host kernel logs for anomalous activities related to the GPU driver.
Patch Details
Patched in NVIDIA Linux driver version 550.40.10. See NVIDIA Security Bulletin 5594.