Privilege Escalation via Malformed CUDA Kernel in NVIDIA GPU Drivers for Linux
Overview
A critical vulnerability was discovered in the NVIDIA kernel-mode driver for Linux, enabling privilege escalation and container escape on multi-tenant AI/ML platforms. The vulnerability, a use-after-free in the driver's memory management unit for GPU memory, can be triggered from user-space code. An attacker with the ability to run arbitrary code on a GPU, such as a user on a shared JupyterHub instance or a malicious ML model from a public repository, could exploit this flaw. The exploit involves crafting a custom CUDA kernel, which can be embedded within a standard PyTorch or TensorFlow model. When the model is loaded and the kernel is launched, it makes a sequence of specific memory allocation and deallocation calls that trigger the use-after-free condition in the `nvidia.ko` driver. Successful exploitation allows the attacker to overwrite kernel memory, leading to arbitrary code execution with Ring 0 (root) privileges on the host operating system. This completely breaks the isolation between tenants and containers, allowing the attacker to access all data on the host machine, compromise other users' workloads, and gain persistent control over the underlying infrastructure.
Affected Systems
Testing Guide
1. **Check Driver Version**: On the host machine, run `nvidia-smi` to check the installed driver version. If it is below 555.48, the system is vulnerable. 2. **Review Workload Sources**: Audit the sources of all ML models and container images running on your GPU infrastructure. Flag any that originate from untrusted public repositories. 3. **Use PoC Tool**: A proof-of-concept tool has been released by the researchers. Run this tool in a controlled, non-production environment to verify if the vulnerability can be triggered.
Mitigation Steps
1. **Patch NVIDIA Drivers**: Immediately update all host systems to NVIDIA driver version 555.48 or newer. 2. **Use GVisor or Kata Containers**: Run GPU workloads inside hardened sandboxes like gVisor or Kata Containers that can intercept and filter dangerous syscalls, providing an additional layer of defense. 3. **Restrict Custom Kernels**: On multi-tenant platforms, disallow users from loading and executing arbitrary custom CUDA kernels where possible. 4. **Enable IOMMU**: Ensure that Input-Output Memory Management Unit (IOMMU) or AMD-Vi is enabled and configured correctly in the BIOS/UEFI and OS to provide memory isolation between devices.
Patch Details
NVIDIA released driver version 555.48 which addresses the use-after-free vulnerability by adding proper locking and validation checks for memory operations.