NVIDIA CUDA Driver Vulnerability Allows GPU Memory Hijacking in Multi-Tenant AI Cloud Environments
Overview
A critical privilege escalation vulnerability was identified in the NVIDIA CUDA driver, affecting how GPU memory is isolated between different processes and containers. The flaw, designated 'GPUSnoop,' allowed a low-privilege process running in one container to read and write to arbitrary memory locations on a GPU being used by another, completely separate container on the same host machine. This breaks the security model of multi-tenant AI platforms like AWS SageMaker, GCP Vertex AI, and Azure ML. An attacker could rent a cheap GPU instance, deploy a malicious inference workload, and use the exploit to access the GPU memory of other tenants. This would allow them to steal sensitive data being processed (e.g., proprietary datasets, financial information), capture entire AI models loaded into VRAM, or inject malicious data into another user's training or inference job, effectively poisoning their model. The root cause was a race condition in the driver's Unified Virtual Memory (UVM) subsystem when handling page table updates. The discovery was made by researchers at the Georgia Institute of Technology and required deep knowledge of GPU architecture and driver internals.
Affected Systems
Testing Guide
1. On a multi-GPU system or a shared host, run `nvidia-smi` to check the installed driver version. If it is below `550.40.10`, you are vulnerable. 2. A specialized proof-of-concept tool is required to test for memory access violations. Refer to the official NVIDIA security bulletin for testing utilities. 3. In a controlled environment, deploy two separate containers requesting fractional GPUs on the same physical device. Attempt to use the PoC from one container to read memory allocated by a process in the second container.
Mitigation Steps
1. Update NVIDIA drivers on all host machines to version `550.40.10` or newer. 2. Cloud customers should ensure their cloud provider has patched the underlying host infrastructure. Follow provider-specific guidance for restarting or redeploying instances to receive the patch. 3. For highly sensitive workloads, consider using dedicated, single-tenant GPU instances until the patch is fully deployed and verified. 4. Monitor GPU performance and memory access logs for anomalous patterns that could indicate an exploit attempt.
Patch Details
Patched in NVIDIA driver version 550.40.10. Cloud providers like AWS, GCP, and Azure have rolled out patched base images and host updates.