Container Breakout in ML Workloads via `runc` File Descriptor Leak Vulnerability
Overview
A critical vulnerability (tracked as CVE-2024-21626, part of the 'Leaky Vessels' disclosures) in `runc`, the industry-standard container runtime, was found to have a severe impact on containerized AI and ML workloads. The vulnerability allows an attacker who has already achieved code execution within a container to escape to the underlying host operating system. The flaw stems from a file descriptor leak during the container initialization process, specifically related to `runc exec`. A malicious program inside the container (e.g., a compromised Python dependency in a training script or a malicious model) could exploit this condition to gain access to a file descriptor of the host filesystem. This effectively bypasses all container isolation mechanisms. For ML workloads, the impact is catastrophic. An attacker could escape a training job container, access the host's GPU devices directly, steal proprietary models and datasets from other containers running on the same node, compromise the Kubernetes node itself, and move laterally within the cloud environment. Since most standard ML container images from NVIDIA, PyTorch, and TensorFlow are built on top of the affected container runtimes (like Docker), a vast number of MLOps environments were immediately at risk upon disclosure.
Affected Systems
Testing Guide
1. **Check `runc` Version:** On each host node, run `runc --version`. If the version is below `1.1.12`, the node is vulnerable. 2. **Check Docker Version:** Run `docker --version`. If the version is below `25.0.1` (or `24.0.7`, `23.0.11` for older releases), the system is likely vulnerable. 3. **Use a Vulnerability Scanner:** Use a container image scanner (e.g., Trivy, Grype) or a cloud security posture management (CSPM) tool to scan running containers and host configurations for CVE-2024-21626. 4. **Consult Cloud Provider Bulletins:** Check the security bulletins for your cloud provider (AWS, GCP, Azure) for information on affected services and patching status.
Mitigation Steps
1. **Patch Container Runtimes:** Immediately update `runc` to version `1.1.12` or later and Docker Engine to `25.0.1` or later on all host machines (VMs, bare metal servers). 2. **Update Kubernetes Nodes:** For managed Kubernetes services (EKS, GKE, AKS), follow the provider's instructions for updating node pools to patched versions. For self-managed clusters, ensure the underlying OS and container runtime are patched on all nodes. 3. **Rebuild Base Images:** Rebuild all custom ML container images using a patched base image. Pull the latest official images from Docker Hub, NVIDIA NGC, etc., as they have been updated. 4. **Use Stricter Security Contexts:** Apply Kubernetes Security Contexts or Pod Security Standards (e.g., `Baseline` or `Restricted`) to limit the capabilities of workloads, reducing the impact of a potential breakout.
Patch Details
Patched in runc v1.1.12 and subsequently in Docker versions 25.0.1, 24.0.7, 23.0.11 and newer.