NVIDIA Driver Use-After-Free Allows Container Escape in GPU-Accelerated Kubernetes
Overview
A use-after-free vulnerability was discovered in the NVIDIA kernel mode driver for Linux. The flaw could be triggered by a malicious container running a GPU-accelerated workload in a shared Kubernetes environment, a common setup for multi-tenant AI platforms. The vulnerability occurs in the driver's memory management unit (MMU) notifier subsystem when handling specific sequences of CUDA API calls related to memory unmapping and resource deallocation. An attacker with execution privileges inside a container with GPU access can craft a program that triggers this condition, causing the driver to reference freed kernel memory. Successful exploitation allows the attacker to write arbitrary data to this memory, leading to a kernel panic (denial of service) or, with more advanced techniques, privilege escalation and arbitrary code execution in the context of the host's kernel. This effectively breaks the container isolation boundary, allowing the attacker to compromise the entire node and access or disrupt all other containers running on it, including those of other tenants. The vulnerability poses a significant risk to cloud providers and enterprises using GPUs for shared ML training and inference tasks.
Affected Systems
Testing Guide
1. Identify the NVIDIA driver version on your Kubernetes nodes by running `nvidia-smi` on the host. 2. Compare the installed version against the patched versions listed in the official NVIDIA security bulletin for this vulnerability. 3. In a non-production environment, security teams can use the proof-of-concept (PoC) code released by the discovering researchers to attempt to trigger the kernel panic on a vulnerable node. A successful crash confirms the vulnerability.
Mitigation Steps
1. **Update Drivers:** Immediately update all NVIDIA drivers on host machines to a patched version as specified in the NVIDIA security bulletin. 2. **Use Hardened Runtimes:** For high-security workloads, run GPU containers using stronger isolation technologies like Kata Containers, which utilize a lightweight hypervisor for each pod. 3. **Limit GPU Access:** Only grant GPU access to trusted workloads. Use Kubernetes taints, tolerations, and node selectors to schedule untrusted pods on nodes without GPUs. 4. **Kernel Hardening:** Apply kernel hardening best practices on host nodes, such as enabling Kernel Page-Table Isolation (KPTI) and Control-Flow Enforcement (CET).
Patch Details
NVIDIA released patched driver versions 550.99.01 and 555.77.02 to address the use-after-free condition by adding proper locking and validation checks during memory deallocation.