NVIDIA GPU Driver Memory Corruption Allows Container Escape in Multi-Tenant ML Environments
Overview
A critical use-after-free vulnerability was discovered in the NVIDIA kernel mode driver (nv-kernel.o) for datacenter-class GPUs like the H100 and A100 series. The flaw resides in the driver's handling of complex memory mapping operations via the CUDA API. An unprivileged user within a Kubernetes pod or Docker container can craft a specific sequence of memory allocation, mapping, and deallocation calls that triggers a race condition. This race condition causes the kernel driver to retain a stale pointer to a freed memory object. By subsequently making further API calls, the attacker can manipulate this pointer to write arbitrary data to a controlled location in kernel memory. A sophisticated exploit can leverage this primitive to achieve arbitrary kernel code execution on the host operating system. In a shared, multi-tenant cloud or on-premise GPU cluster, this vulnerability is devastating. It allows a malicious tenant to break out of their containerized environment, gaining full control over the underlying host node. From there, they can access the data and workloads of all other tenants on the same machine, intercept sensitive model data from GPU memory, and potentially pivot to attack the wider cluster control plane. The discovery was made by security researchers performing fuzz testing on the CUDA API surface from within sandboxed environments.
Affected Systems
Testing Guide
1. Check your current NVIDIA driver version by running `nvidia-smi` on the host node. 2. Compare the displayed driver version against the patched versions listed in the relevant NVIDIA Security Bulletin. 3. If your version is lower than the patched version for your release branch, you are affected. 4. There is no safe way to actively test for the exploit without a proof-of-concept, which is typically not publicly released for kernel exploits.
Mitigation Steps
1. **Patch Immediately**: Update all host nodes with NVIDIA GPUs to the latest patched driver version provided by NVIDIA or your cloud provider (AWS, GCP, Azure). 2. **Isolate Sensitive Workloads**: If immediate patching is not possible, run highly sensitive ML workloads on dedicated nodes that are not shared with untrusted tenants. 3. **Use Sandboxing Technologies**: Employ stronger sandboxing technologies like gVisor or Kata Containers for GPU workloads, which can add an extra layer of kernel isolation and mitigate the impact of such vulnerabilities. 4. **Monitor for Anomalous API Calls**: Implement monitoring for unusual patterns of CUDA API calls from containers, which might indicate an exploitation attempt.
Patch Details
Patched in driver versions 535.154.01 and 550.54.14 and later.