NVIDIA CUDA Driver Race Condition Allows GPU Memory Access Across Kubernetes Pods
Overview
A high-severity race condition vulnerability was found in specific versions of the NVIDIA CUDA driver for Linux. The flaw existed in the driver's memory management unit (MMU) when handling GPU context switching between different processes, particularly in containerized environments like Kubernetes. Under heavy, concurrent GPU workloads from multiple pods on the same node, an attacker could craft a specific sequence of CUDA API calls to trigger this race condition. Successful exploitation allowed a malicious pod to read or write to small, non-contiguous chunks of GPU memory belonging to other pods on the same physical GPU. While full memory takeover was difficult, the attack could be used to leak sensitive data being processed by other AI models (e.g., inference data, model parameters) or to cause denial-of-service by corrupting another pod's GPU memory, leading to a driver crash. The vulnerability was particularly concerning for multi-tenant cloud providers offering shared GPU instances, as it broke the primary security boundary between customer workloads.
Affected Systems
Testing Guide
1. Verify the installed NVIDIA driver version on your GPU nodes by running `nvidia-smi`. 2. Compare the installed version against the patched versions listed in the NVIDIA security bulletin. 3. A proof-of-concept exploit is complex and requires specialized code, but you can check for related error messages in the system's kernel log: `dmesg | grep -i "NVRM: Xid"`. Frequent, unexplained Xid errors (especially related to memory) could be an indicator.
Mitigation Steps
1. **Update NVIDIA Drivers**: Immediately update the NVIDIA drivers on all GPU nodes to a patched version (e.g., 550.90.07 or 555.52.04 and later). 2. **Use GPU Time-Slicing with Caution**: If possible, avoid sharing physical GPUs between untrusted tenants. Dedicate full GPUs to pods handling sensitive data. 3. **Monitor GPU Metrics**: Monitor for anomalous GPU behavior, such as unexpected kernel panics or memory access errors, which could indicate an exploitation attempt. 4. **Enable MIG**: For compatible GPUs (e.g., A100, H100), use Multi-Instance GPU (MIG) to create hardware-level isolated GPU instances for each pod, which fully mitigates this attack.
Patch Details
NVIDIA released patched driver versions 550.90.07 and 555.52.04 which resolve the race condition by implementing stricter locking mechanisms during GPU context switches.