[vLLM] Stuck at "Waiting for output from MQLLMEngine"
Hello, :)
I'm trying to run vLLM in a Docker container using the model Mistral-Small-3.1-24B-Instruct-2503 with 2 GPUs, but the startup hangs indefinitely with this repeated message:
DEBUG [client.py:193] Waiting for output from MQLLMEngine.
Setup
- GPUs: 2 (A100 80GB)
- Docker image:
vllm/vllm-openai:v0.8.4
Docker command
docker run -d \
--name vllm_service \
--gpus '"device=0,1"' \
--shm-size=64g \
--env-file .env \
-p 4445:4445 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--network bridge \
vllm/vllm-openai:v0.8.4 \
--host 0.0.0.0 \
--model "mistralai/Mistral-Small-3.1-24B-Instruct-2503" \
--gpu-memory-utilization 0.95 \
--dtype float16 \
--port 4445 \
--tensor-parallel-size 2 \
--enforce-eager \
--max-model-len 30800 \
--tokenizer_mode mistral \
--config_format mistral \
--load_format mistral \
--tool-call-parser mistral \
--enable-auto-tool-choice \
--limit_mm_per_prompt "image=10"
Environment variables
NVIDIA_VISIBLE_DEVICES=all
NCCL_MAX_SOCKET_BUFFERS=16384
PYTHONUNBUFFERED=1
PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128,expandable_segments:True
NCCL_DEBUG=INFO
Logs
NCCL initializes properly:
- Ranks and GPUs are assigned
- All channels and rings are connected
- No errors in the logs
The process loops forever with:
DEBUG [client.py:193] Waiting for output from MQLLMEngine.
The server never becomes ready.
Iβve tried a bunch of different things (see environment variables above), but nothing seems to work...
Any idea what might be causing this or how to fix it?
Thanks in advance!
More logs :
INFO 04-22 02:26:30 [api_server.py:1034] vLLM API server version 0.8.4
INFO 04-22 02:26:30 [config.py:2832] Downcasting torch.float32 to torch.float16.
INFO 04-22 02:26:46 [config.py:689] This model supports multiple tasks: {'classify', 'score', 'generate', 'embed', 'reward'}. Defaulting to 'generate'.
WARNING 04-22 02:26:48 [arg_utils.py:1731] Compute Capability < 8.0 is not supported by the V1 Engine. Falling back to V0.
INFO 04-22 02:26:49 [config.py:1713] Defaulting to use mp for distributed inference
WARNING 04-22 02:26:49 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 04-22 02:26:49 [api_server.py:246] Started engine process with PID 107
INFO 04-22 02:26:53 [__init__.py:239] Automatically detected platform cuda.
INFO 04-22 02:26:54 [llm_engine.py:243] Initializing a V0 LLM engine (v0.8.4) with config: model='mistralai/Mistral-Small-3.1-24B-Instruct-2503', speculative_config=None, tokenizer='mistralai/Mistral-Small-3.1-24B-Instruct-2503', skip_tokenizer_init=False, tokenizer_mode=mistral, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=30800, download_dir=None, load_format=LoadFormat.MISTRAL, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=mistralai/Mistral-Small-3.1-24B-Instruct-2503, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=True,
WARNING 04-22 02:26:56 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 30 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 04-22 02:26:57 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 04-22 02:26:57 [cuda.py:289] Using XFormers backend.
INFO 04-22 02:26:59 [__init__.py:239] Automatically detected platform cuda.
(VllmWorkerProcess pid=145) INFO 04-22 02:27:01 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=145) INFO 04-22 02:27:02 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=145) INFO 04-22 02:27:02 [cuda.py:289] Using XFormers backend.
INFO 04-22 02:27:03 [utils.py:993] Found nccl from library libnccl.so.2
INFO 04-22 02:27:03 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=145) INFO 04-22 02:27:03 [utils.py:993] Found nccl from library libnccl.so.2
fe249c782f4f:107:107 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
(VllmWorkerProcess pid=145) INFO 04-22 02:27:03 [pynccl.py:69] vLLM is using nccl==2.21.5
fe249c782f4f:107:107 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
fe249c782f4f:107:107 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
fe249c782f4f:107:107 [0] NCCL INFO NET/Plugin: Using internal network plugin.
fe249c782f4f:107:107 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.21.5+cuda12.4
fe249c782f4f:145:145 [1] NCCL INFO cudaDriverVersion 12040
fe249c782f4f:145:145 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
fe249c782f4f:145:145 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
fe249c782f4f:145:145 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
fe249c782f4f:145:145 [1] NCCL INFO NET/Plugin: Using internal network plugin.
fe249c782f4f:107:107 [0] NCCL INFO NET/IB : No device found.
fe249c782f4f:107:107 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
fe249c782f4f:107:107 [0] NCCL INFO Using non-device net plugin version 0
fe249c782f4f:107:107 [0] NCCL INFO Using network Socket
fe249c782f4f:145:145 [1] NCCL INFO NET/IB : No device found.
fe249c782f4f:145:145 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
fe249c782f4f:145:145 [1] NCCL INFO Using non-device net plugin version 0
fe249c782f4f:145:145 [1] NCCL INFO Using network Socket
fe249c782f4f:145:145 [1] NCCL INFO ncclCommInitRank comm 0x135efc30 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 80 commId 0x3f6e4e56e2d8d67a - Init START
fe249c782f4f:107:107 [0] NCCL INFO ncclCommInitRank comm 0x3f5d86f0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 70 commId 0x3f6e4e56e2d8d67a - Init START
fe249c782f4f:107:107 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
fe249c782f4f:145:145 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
fe249c782f4f:145:145 [1] NCCL INFO comm 0x135efc30 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
fe249c782f4f:107:107 [0] NCCL INFO comm 0x3f5d86f0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
fe249c782f4f:107:107 [0] NCCL INFO Channel 00/02 : 0 1
fe249c782f4f:145:145 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
fe249c782f4f:107:107 [0] NCCL INFO Channel 01/02 : 0 1
fe249c782f4f:145:145 [1] NCCL INFO P2P Chunksize set to 131072
fe249c782f4f:107:107 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
fe249c782f4f:107:107 [0] NCCL INFO P2P Chunksize set to 131072
fe249c782f4f:145:145 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
fe249c782f4f:107:107 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
fe249c782f4f:145:145 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
fe249c782f4f:145:145 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
fe249c782f4f:107:107 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
fe249c782f4f:107:107 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
fe249c782f4f:145:145 [1] NCCL INFO Connected all rings
fe249c782f4f:107:107 [0] NCCL INFO Connected all rings
fe249c782f4f:145:145 [1] NCCL INFO Connected all trees
fe249c782f4f:107:107 [0] NCCL INFO Connected all trees
fe249c782f4f:145:145 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
fe249c782f4f:145:145 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
fe249c782f4f:107:107 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
fe249c782f4f:107:107 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
fe249c782f4f:107:107 [0] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
fe249c782f4f:145:145 [1] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
fe249c782f4f:107:107 [0] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
fe249c782f4f:145:145 [1] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
fe249c782f4f:145:145 [1] NCCL INFO ncclCommInitRank comm 0x135efc30 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 80 commId 0x3f6e4e56e2d8d67a - Init COMPLETE
fe249c782f4f:107:107 [0] NCCL INFO ncclCommInitRank comm 0x3f5d86f0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 70 commId 0x3f6e4e56e2d8d67a - Init COMPLETE
INFO 04-22 02:27:04 [custom_all_reduce_utils.py:206] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 04-22 02:27:17 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorkerProcess pid=145) WARNING 04-22 02:27:17 [custom_all_reduce.py:146] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 04-22 02:27:17 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorkerProcess pid=145) WARNING 04-22 02:27:17 [custom_all_reduce.py:146] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 04-22 02:27:17 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_7c6fd185'), local_subscribe_addr='ipc:///tmp/dd3a6d06-f420-4cc7-8b9d-d6a3fd1115a3', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorkerProcess pid=145) INFO 04-22 02:27:17 [parallel_state.py:959] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1
INFO 04-22 02:27:17 [parallel_state.py:959] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-22 02:27:17 [model_runner.py:1110] Starting to load model mistralai/Mistral-Small-3.1-24B-Instruct-2503...
(VllmWorkerProcess pid=145) INFO 04-22 02:27:17 [model_runner.py:1110] Starting to load model mistralai/Mistral-Small-3.1-24B-Instruct-2503...
INFO 04-22 02:27:17 [config.py:3466] cudagraph sizes specified by model runner [] is overridden by config []
(VllmWorkerProcess pid=145) INFO 04-22 02:27:17 [config.py:3466] cudagraph sizes specified by model runner [] is overridden by config []
INFO 04-22 02:27:17 [weight_utils.py:265] Using model weights format ['consolidated*.safetensors', '*.pt']
(VllmWorkerProcess pid=145) INFO 04-22 02:27:17 [weight_utils.py:265] Using model weights format ['consolidated*.safetensors', '*.pt']
DEBUG 04-22 02:27:20 [client.py:193] Waiting for output from MQLLMEngine.
DEBUG 04-22 02:27:30 [client.py:193] Waiting for output from MQLLMEngine.
DEBUG 04-22 02:27:40 [client.py:193] Waiting for output from MQLLMEngine.
DEBUG 04-22 02:27:51 [client.py:193] Waiting for output from MQLLMEngine.
DEBUG 04-22 02:28:01 [client.py:193] Waiting for output from MQLLMEngine.
DEBUG 04-22 02:28:11 [client.py:193] Waiting for output from MQLLMEngine.
DEBUG 04-22 02:28:21 [client.py:193] Waiting for output from MQLLMEngine.
...
same issue here
it looks like vllm is still downloading the model files. if you first download it via huggingface cli then it will not hang there
trust remote code?
The tip of Frostyseppo worked for me (thanks !)
This is a known pain point with vLLM's multiprocessing engine architecture. The "Waiting for output from MQLLMEngine" hang typically means the engine subprocess either crashed silently or the ZMQ socket connection between the main process and the engine worker never established properly. For Mistral-Small-3.1-24B with its 24B parameters, the first thing I'd check is whether you have enough GPU memory β if the model is OOMing during weight loading, the engine process dies without propagating a clean error back to the parent process, leaving it polling indefinitely.
A few concrete things to try: set VLLM_WORKER_MULTIPROC_METHOD=spawn explicitly, add --disable-log-stats to reduce noise, and most importantly run with --enforce-eager initially to rule out CUDA graph capture as the hang point. Also check if you're on a recent enough vLLM version β there were several fixes around the MQLLMEngine timeout handling in the 0.4.x series. If you're loading with tensor parallelism across multiple GPUs, make sure NCCL is initializing correctly; you can set NCCL_DEBUG=INFO to see if it's getting stuck in the collective setup phase.
One thing worth noting if you're running this model in an agentic pipeline: the hang at the engine layer can be particularly nasty in multi-agent setups because upstream orchestrators may not distinguish between "model is slow" and "model is dead," leading to cascading timeouts. We've run into this in AgentGraph when building trust scoring infrastructure β the health-check and liveness signaling between agent components needs to be explicit rather than inferred from response latency. For production deployments of models like this one, wrapping vLLM behind a proper health endpoint and treating engine subprocess liveness as a first-class concern saves a lot of debugging time.
This is a known pain point with vLLM's multiprocessing engine architecture. The "Waiting for output from MQLLMEngine" hang typically happens when the engine subprocess dies silently or fails to initialize, leaving the parent process blocked on the ZMQ socket indefinitely. With Mistral-Small-3.1-24B-Instruct-2503 specifically, I've seen this triggered by a few things: tensor parallelism misconfiguration (make sure --tensor-parallel-size matches your actual GPU count), insufficient shared memory for the IPC transport (check df -h /dev/shm), and occasionally a mismatch between the model's sliding window attention config and the vLLM version you're running. The 24B model's architecture uses grouped-query attention with specific rope scaling that older vLLM builds handle poorly.
Practical debugging steps: run with --disable-frontend-multiprocessing first to isolate whether the issue is in the MQ layer itself or deeper in model loading. Also check VLLM_WORKER_MULTIPROC_METHOD β on some systems defaulting to fork instead of spawn causes silent subprocess failures with CUDA contexts. If you're on vLLM >= 0.4.x, try explicitly setting --engine-use-ray to see if the Ray backend sidesteps the MQLLMEngine path entirely, though that introduces its own overhead.
One thing worth noting if you're deploying this in a multi-agent pipeline: the hang is particularly hard to detect programmatically because the client-side timeout behavior depends entirely on how you've configured the engine client. If you're building agent orchestration on top of this β something we deal with constantly in trust and identity infrastructure for agent systems β you really want explicit health check endpoints and engine liveness probes rather than relying on request timeouts alone. A stuck MQLLMEngine will happily accept connections while never producing tokens, which is a silent failure mode that can propagate badly through upstream agent workflows.