aryansriva05's picture
Create README.md
1bcd2ca verified
metadata
license: mit
tags:
  - security
  - vulnerability-report

Vulnerability Report: Unauthenticated RCE in TensorRT-LLM (MGMN Leader Node)

Summary

I have identified a Critical Remote Code Execution (RCE) vulnerability in the TensorRT-LLM Multi-GPU Multi-Node (MGMN) launcher. The vulnerability exists in the mgmn_leader_node.py script, which initializes an IPC server without enforcing HMAC authentication. Combined with insecure environment variable handling, this allows a local or network attacker to force the server to bind to an external interface and execute arbitrary code.

Vulnerability Details

Component: tensorrt_llm/llmapi/mgmn_leader_node.py and tensorrt_llm/llmapi/mpi_session.py

Root Cause:

  1. Insecure Default Initialization: In tensorrt_llm/llmapi/mgmn_leader_node.py, the RemoteMpiCommSessionServer is initialized without passing an hmac_key.

    # mgmn_leader_node.py
    server = RemoteMpiCommSessionServer(
        comm=sub_comm,
        n_workers=num_ranks,
        addr=get_spawn_proxy_process_ipc_addr_env(), 
        is_comm=True) # MISSING hmac_key
    

Security Fallback Failure: In tensorrt_llm/llmapi/mpi_session.py, the init method sets use_hmac_encryption to False if no key is provided.

Python

mpi_session.py

self.queue = ZeroMqQueue(..., use_hmac_encryption=bool(hmac_key)) This disables the signature check on the IPC socket, allowing unauthenticated pickle.loads deserialization.

Insecure Environment Variable Handling: The bind address is derived from TLLM_SPAWN_PROXY_PROCESS_IPC_ADDR (in utils.py), which can be controlled by any user on the system before the service starts.

Attack Scenario An attacker sets the environment variable: export TLLM_SPAWN_PROXY_PROCESS_IPC_ADDR="tcp://0.0.0.0:4444"

The victim (or automated orchestration system) executes mgmn_leader_node.py.

The server binds to port 4444 on all interfaces with HMAC Encryption Disabled.

The attacker connects to port 4444 and sends a malicious Pickle payload containing shell commands (e.g., reverse shell).

The ZeroMqQueue class deserializes the payload without verification, executing the attacker's code with the privileges of the TensorRT-LLM process.

Impact This vulnerability allows for Arbitrary Code Execution (ACE). In shared cluster environments (e.g., Slurm/Kubernetes), this allows a low-privileged user to escalate privileges or move laterally to other nodes running TensorRT-LLM.