NVIDIA DCGM Privilege Escalation in GPU-Accelerated Kubernetes Clusters
Overview
A high-severity privilege escalation vulnerability was identified in NVIDIA's Data Center GPU Manager (DCGM) component, affecting multi-tenant Kubernetes clusters used for AI/ML workloads. The vulnerability, tracked as CVE-2023-25515, resides in the DCGM daemonset that runs on each GPU-enabled node. A flaw in the inter-process communication (IPC) mechanism allows a low-privileged user within a container to send a specially crafted message to the DCGM process running on the host. Successful exploitation of this vulnerability allows the attacker to execute arbitrary code with root privileges on the underlying host node. In a typical AI platform environment, this allows a malicious data scientist or a compromised ML training job to break out of its container sandbox and gain full control of the host. From there, the attacker can compromise all other pods (including those of other tenants), access sensitive data and models, and potentially pivot to other parts of the network. This vulnerability is particularly critical because GPU-accelerated clusters are often shared resources, and the exploit does not require any special capabilities within the source container, making it a significant threat to the security and isolation of shared AI infrastructure. The issue was caused by improper input validation on messages received over a UNIX socket.
Affected Systems
Testing Guide
1. Check the version of DCGM running on your GPU nodes by exec-ing into a DCGM pod and running `dcgmi --version`. 2. If the version is prior to 3.1.8, the system is vulnerable. 3. Alternatively, check the version of the `gpu-operator` chart or image tag being used in your Kubernetes cluster. If it is prior to v22.9.2, you are likely affected.
Mitigation Steps
1. **Update NVIDIA Drivers and DCGM:** Upgrade the host drivers and DCGM to version 3.1.8 or later. 2. **Update GPU Operator:** If using Kubernetes, update the NVIDIA GPU Operator to version v22.9.2 or later, which includes the patched DCGM version. 3. **Restrict Node Access:** Limit `exec` access to pods running on GPU nodes to only trusted administrators. 4. **Use GVisor or Kata Containers:** For workloads requiring strong isolation, consider running them inside runtime sandboxes like gVisor or Kata Containers to provide an additional layer of kernel-level isolation.
Patch Details
Patched in DCGM 3.1.8 and incorporated into GPU Operator v22.9.2.