NVIDIA CUDA Driver Integer Overflow Allows Privilege Escalation from ML Containers
Overview
A critical integer overflow vulnerability was discovered in the NVIDIA kernel mode driver for data center GPUs (e.g., A100, H100 series). The vulnerability resides in the driver's ioctl handler responsible for GPU memory management, which is exposed to user-mode processes, including those within containers in a Kubernetes environment with GPU passthrough. An attacker with code execution inside a container with access to a GPU device (`/dev/nvidia*`) could craft a malicious request to the driver with specifically chosen large values for memory allocation size and offset. These values would pass initial checks but lead to an integer overflow during an internal multiplication operation used to calculate the final buffer size. This resulted in the allocation of a much smaller kernel heap buffer than expected. A subsequent `memcpy` operation, using the original, non-overflowed size, would write data far beyond the bounds of this small buffer, causing a kernel heap overflow. A sophisticated attacker could leverage this memory corruption to overwrite function pointers or critical kernel data structures, allowing them to gain arbitrary code execution in the context of the host OS kernel. This effectively allows a complete container escape and full compromise of the underlying Kubernetes node.
Affected Systems
Testing Guide
1. On a host system running a vulnerable driver version, check the loaded kernel module version: `cat /proc/driver/nvidia/version`. 2. If the version is listed as affected, the system is vulnerable. 3. A proof-of-concept exploit would involve a compiled C program running inside a Docker container (started with `--gpus all`) that opens `/dev/nvidia-uvm` and sends a specially crafted `ioctl` call. A successful (but destructive) test would crash the host machine (kernel panic). Do not run public PoCs on production systems.
Mitigation Steps
1. **Update Drivers:** Immediately update all NVIDIA drivers on host machines to the patched versions provided by NVIDIA. 2. **Restrict GPU Access:** In multi-tenant environments, use admission controllers and security policies to restrict GPU access to only trusted workloads and namespaces. 3. **Use Secure Runtimes:** Run GPU-accelerated containers using security-hardened runtimes like gVisor or Kata Containers, which can provide an additional layer of isolation between the container and the host kernel. 4. **Monitor Kernel Logs:** Monitor host kernel logs for signs of instability or crashes (kernel panics), which could indicate exploitation attempts.
Patch Details
Patched in NVIDIA driver versions 550.90.07 and 535.170.04 and later.