NVIDIA GPU Driver Use-After-Free Allows Privilege Escalation in Multi-Tenant ML Clusters
Overview
A high-severity use-after-free vulnerability was identified in the NVIDIA GPU driver's kernel mode layer. The flaw can be triggered by a user-mode client sending a specially crafted sequence of inputs to the driver, leading to a race condition where a memory object is freed but a pointer to it is retained and later used. An attacker with local, low-privilege access can exploit this to write to arbitrary kernel memory. In the context of AI/ML infrastructure, this vulnerability is particularly dangerous in multi-tenant environments like Kubernetes clusters using GPU sharing or MIG (Multi-Instance GPU). A malicious user or a compromised container with access to a GPU slice could exploit this vulnerability to achieve privilege escalation on the host node. A successful exploit could allow an attacker to break out of their container, gain root access on the host, and subsequently access or disrupt all other ML workloads running on that node, including sensitive data and models from other tenants. The impact ranges from a denial of service (crashing the host node) to complete system compromise and loss of workload isolation. This vulnerability underscores the importance of securing the lowest levels of the AI infrastructure stack, as flaws in hardware drivers can undermine all higher-level security controls.
Affected Systems
Testing Guide
1. Check the currently installed NVIDIA driver version on your Linux host using the `nvidia-smi` command. 2. Compare the reported driver version with the patched versions listed in the NVIDIA Security Bulletin for CVE-2024-0073. 3. If your version is lower than the patched version (e.g., less than 550.54.14 for the 550.x series), the system is vulnerable and should be updated immediately.
Mitigation Steps
1. **Update NVIDIA Drivers**: Immediately update all GPU drivers on host machines to the patched versions specified in the NVIDIA security bulletin. 2. **Restrict GPU Access**: In multi-tenant environments, use security mechanisms like AppArmor or SELinux profiles to restrict the system calls that containers can make to the GPU driver. 3. **Use Sandboxed Runtimes**: Employ sandboxed container runtimes like gVisor or Kata Containers for running untrusted workloads, adding an extra layer of kernel isolation. 4. **Monitor for Anomalies**: Implement monitoring on host nodes to detect anomalous driver interactions or kernel-level crashes, which could indicate an exploitation attempt.
Patch Details
Patched in NVIDIA driver versions 550.54.14, 545.29.06, 535.154.05 and newer.