NVIDIA GPU Driver Use-After-Free Flaw Allows Privilege Escalation in Multi-Tenant ML Environments
Overview
A high-severity use-after-free vulnerability was identified in the NVIDIA GPU display driver for Linux, posing a significant risk to multi-tenant AI/ML infrastructure. The flaw, tracked as CVE-2025-10777, exists in the kernel mode driver's memory management component responsible for handling user-mode submissions. A local, unprivileged attacker with access to the GPU (a common scenario in shared Jupyter notebook environments or Kubernetes clusters with GPU sharing) can craft a specific sequence of API calls that causes the driver to incorrectly release a memory object while retaining a pointer to it. Subsequently, the attacker can re-allocate memory in a controlled manner to occupy this freed space with a malicious payload. When the driver later attempts to use the dangling pointer, it executes the attacker's shellcode with kernel-level privileges. A successful exploit allows the attacker to escape their container or user session and gain full root access to the underlying host node. This completely breaks tenant isolation, enabling the attacker to access other users' data, steal models, or compromise the entire training cluster.
Affected Systems
Testing Guide
1. **Check Driver Version:** On a Linux host with an NVIDIA GPU, run `nvidia-smi` to display the installed driver version. 2. **Compare with Bulletin:** Cross-reference the installed version with the 'Affected Versions' list in the official NVIDIA security bulletin for the corresponding CVE. 3. **Run Exploit PoC:** In a controlled, non-production environment, attempt to run a public Proof-of-Concept (PoC) exploit for the CVE. A successful exploit will typically result in a root shell (`#`) being spawned from an unprivileged user account.
Mitigation Steps
1. **Update Drivers:** Immediately update all affected host systems to the latest patched NVIDIA driver version as specified in the NVIDIA security bulletin. 2. **Isolate Workloads:** Do not rely solely on containerization for security. Use stronger isolation mechanisms like virtual machines (VMs) or bare-metal separation for untrusted workloads. 3. **Restrict GPU Access:** In multi-tenant environments, use Kubernetes device plugins and policies to strictly control which pods can access GPU resources. Avoid sharing a single physical GPU across multiple untrusted tenants using MIG (Multi-Instance GPU) unless all security patches are applied. 4. **Monitor for Anomalies:** Use system monitoring tools to detect anomalous GPU memory usage or unexpected kernel-level activity.
Patch Details
Patches are available in driver branches R555 and later. NVIDIA has released security updates addressing this issue.