Heap Overflow in NVIDIA Triton Inference Server ONNX Runtime Backend Leads to Remote Code Execution
Overview
A critical heap-based buffer overflow vulnerability was discovered in NVIDIA's Triton Inference Server, specifically within its ONNX Runtime backend. The vulnerability, identified as CVE-2024-0089, arises from improper validation of metadata in a user-provided ONNX model file. An unauthenticated remote attacker can craft a malicious ONNX model with specially formed tensor shapes or attributes. When the Triton server attempts to load this model from a repository for inference, the parsing logic fails to correctly allocate sufficient memory, leading to a heap overflow. This can cause an immediate denial of service by crashing the server process. More critically, a skilled attacker could leverage this overflow for arbitrary code execution. By carefully crafting the model file, the attacker can overwrite adjacent heap metadata and control the program's instruction pointer, allowing them to execute malicious shellcode with the permissions of the Triton server process. As Triton is often run in privileged environments with direct GPU access, a successful exploit could lead to complete container escape and host system compromise, enabling theft of sensitive models and data.
Affected Systems
Testing Guide
1. Check the version of your running NVIDIA Triton Inference Server. This can often be found in container logs or by querying the server's metadata endpoint. 2. If the version is earlier than 24.01 (corresponding to the NGC container version 24.01-py3), the instance is vulnerable. 3. A non-destructive test involves attempting to load a publicly available Proof-of-Concept (PoC) ONNX file designed to trigger the crash. If the server process terminates unexpectedly, you are affected.
Mitigation Steps
1. **Update Triton Server:** Immediately upgrade NVIDIA Triton Inference Server to version 24.01 or later, which contains the patch for this vulnerability. 2. **Scan Models:** Implement a model scanning and validation step in your MLOps pipeline before loading any new model into Triton. Use security scanners and linting tools to check for malformed or suspicious model files. 3. **Isolate Triton:** Run Triton Inference Server in a minimal, sandboxed environment (e.g., gVisor, Kata Containers) with strict network policies to limit the impact of a potential compromise. 4. **Limit Model Repository Access:** Restrict write access to the model repositories that Triton loads from. Ensure only trusted and authenticated entities can upload or modify models.
Patch Details
The vulnerability is patched in Triton Inference Server version 24.01, available in the NVIDIA NGC catalog.