NVIDIA DCGM Integer Overflow Allows Privilege Escalation on GPU Nodes
Overview
A high-severity vulnerability was discovered in the NVIDIA Data Center GPU Manager (DCGM), a software suite for managing and monitoring NVIDIA GPUs in cluster environments. The vulnerability, tracked as CVE-2023-25515, is an integer overflow flaw within a DCGM component that processes user-submitted metrics. A local attacker with basic user privileges could craft a specific input that triggers this overflow. Successful exploitation could lead to a denial of service (DoS) by crashing the DCGM service, disrupting monitoring and management capabilities for all GPUs on the node. More critically, researchers demonstrated that the overflow could be controlled to achieve arbitrary code execution with the privileges of the DCGM service, which often runs as a privileged system user. In multi-tenant AI training or cloud GPU environments, this vulnerability allows an attacker who has compromised one user's container or virtual machine to break out and escalate privileges on the host node, potentially gaining control over the entire machine and accessing data from other users' workloads. This highlights the risk in the foundational infrastructure software that underpins large-scale AI systems.
Affected Systems
Testing Guide
1. **Check DCGM Version**: On a GPU-enabled node, run the command `dcgmi --version` or check the package manager for the installed version of the `datacenter-gpu-manager` package. 2. **Compare with Affected Versions**: Compare your installed version against the list of affected versions provided in the NVIDIA Security Bulletin for CVE-2023-25515. 3. **If the version is listed as vulnerable**, the system is affected and should be patched immediately.
Mitigation Steps
1. **Update DCGM**: Upgrade to a patched version of NVIDIA DCGM as specified in the official security bulletin (e.g., 3.1.8, 2.4.14, or later). 2. **Restrict Access**: Limit access to GPU nodes to trusted users only. Enforce strong access control policies for multi-tenant environments. 3. **Use Sandboxing**: Run AI workloads in strongly isolated containers or virtual machines with limited privileges to contain potential exploits. 4. **Monitor System Logs**: Monitor DCGM and system logs for repeated crashes or anomalous behavior that could indicate exploitation attempts.
Patch Details
NVIDIA released patches in DCGM versions 3.1.8, 2.4.14, and 2.3.10. Users should upgrade to the latest version available for their driver branch.