Text Generation
Transformers
Safetensors
PyTorch
nemotron_h
nvidia
nemotron-3
latent-moe
mtp
conversational
custom_code
8-bit precision
modelopt

Doesn't work with latest vllm, even tried to recompile vLLM and transformers from git

#8
by catplusplus - opened

ar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] EngineCore failed to start.
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] Traceback (most recent call last):
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] File "/home/olegk/venv/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1086, in run_engine_core
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] File "/home/olegk/venv/vllm/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] return func(*args, **kwargs)
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] ^^^^^^^^^^^^^^^^^^^^^
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] File "/home/olegk/venv/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 830, in init
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] super().init(
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] File "/home/olegk/venv/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 120, in init
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] kv_cache_config = self._initialize_kv_caches(vllm_config)
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] File "/home/olegk/venv/vllm/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] return func(*args, **kwargs)
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] ^^^^^^^^^^^^^^^^^^^^^
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] File "/home/olegk/venv/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 243, in _initialize_kv_caches
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] available_gpu_memory = self.model_executor.determine_available_memory()
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] File "/home/olegk/venv/vllm/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 136, in determine_available_memory
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] return self.collective_rpc("determine_available_memory")
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] File "/home/olegk/venv/vllm/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 78, in collective_rpc
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] result = run_method(self.driver_worker, method, args, kwargs)
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] File "/home/olegk/venv/vllm/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 459, in run_method
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] return func(*args, **kwargs)
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] ^^^^^^^^^^^^^^^^^^^^^
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] File "/home/olegk/venv/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] return func(*args, **kwargs)
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] ^^^^^^^^^^^^^^^^^^^^^
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] File "/home/olegk/venv/vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 397, in determine_available_memory
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] File "/home/olegk/venv/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] return func(*args, **kwargs)
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] ^^^^^^^^^^^^^^^^^^^^^
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] File "/home/olegk/venv/vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5654, in profile_cudagraph_memory
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] self._warmup_and_capture(
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] File "/home/olegk/venv/vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5784, in _warmup_and_capture
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] self._dummy_run(
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] File "/home/olegk/venv/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] return func(*args, **kwargs)
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] ^^^^^^^^^^^^^^^^^^^^^
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] File "/home/olegk/venv/vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5148, in _dummy_run
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] attn_metadata, _ = self._build_attention_metadata(
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] File "/home/olegk/venv/vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2102, in _build_attention_metadata
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] _build_attn_group_metadata(kv_cache_gid, attn_gid, cm)
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] File "/home/olegk/venv/vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2053, in _build_attn_group_metadata
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] attn_metadata_i = builder.build(
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] ^^^^^^^^^^^^^^
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] File "/home/olegk/venv/vllm/lib/python3.12/site-packages/vllm/v1/attention/backends/mamba2_attn.py", line 135, in build
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] common = self._compute_common_metadata(
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] File "/home/olegk/venv/vllm/lib/python3.12/site-packages/vllm/v1/attention/backends/mamba_attn.py", line 477, in _compute_common_metadata
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] return self._update_metadata_for_cudagraph_capture(metadata)
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] File "/home/olegk/venv/vllm/lib/python3.12/site-packages/vllm/v1/attention/backends/mamba_attn.py", line 498, in update_metadata_for_cudagraph_capture
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] self.state_indices_tensor_d[: metadata.num_decodes].copy
(
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) ERROR 03-11 20:54:48 [core.py:1096] RuntimeError: The size of tensor a (32) must match the size of tensor b (33) at non-singleton dimension 1
Mar 11 20:54:48 amano unglitched_vllm[541497]: (EngineCore_DP0 pid=541497) Process EngineCore_DP0:
Mar 11 20:54:4

The error with flashinfer attention. Same build, Qwen 3.5 122B works fine with same options.
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) File "/home/olegk/venv/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) return forward_call(*args, **kwargs)
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) File ".99", line 523, in forward
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) submod_1 = self.submod_1(getitem, s72, getitem_1); getitem = submod_1 = None
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) File "/home/olegk/venv/vllm/lib/python3.12/site-packages/torch/fx/graph_module.py", line 949, in call_wrapped
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) return self._wrapped_call(self, *args, **kwargs)
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) File "/home/olegk/venv/vllm/lib/python3.12/site-packages/torch/fx/graph_module.py", line 461, in call
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) raise e
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) File "/home/olegk/venv/vllm/lib/python3.12/site-packages/torch/fx/graph_module.py", line 447, in call
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) return super(self.cls, obj).call(*args, **kwargs) # type: ignore[misc]
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) File "/home/olegk/venv/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) return self._call_impl(*args, **kwargs)
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) File "/home/olegk/venv/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) return forward_call(*args, **kwargs)
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) File ".101", line 5, in forward
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) mamba_mixer2 = torch.ops.vllm.mamba_mixer2(output, ssm_output, 'model.layers.0.mixer'); output = ssm_output = mamba_mixer2 = None
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) File "/home/olegk/venv/vllm/lib/python3.12/site-packages/torch/_ops.py", line 1275, in call
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) return self._op(*args, **kwargs)
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) ^^^^^^^^^^^^^^^^^^^^^^^^^
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) File "/home/olegk/venv/vllm/lib/python3.12/site-packages/vllm/model_executor/layers/mamba/mamba_mixer2.py", line 926, in mamba_mixer2
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) self.conv_ssm_forward(projected_states=projected_states, output=output)
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) File "/home/olegk/venv/vllm/lib/python3.12/site-packages/vllm/model_executor/layers/mamba/mamba_mixer2.py", line 874, in conv_ssm_forward
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) selective_state_update(
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) File "/home/olegk/venv/vllm/lib/python3.12/site-packages/vllm/model_executor/layers/mamba/ops/mamba_ssm.py", line 338, in selective_state_update
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) assert state_batch_indices is not None and state_batch_indices.dim() == 2
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Mar 12 08:03:29 amano unglitched_vllm[709328]: (EngineCore_DP0 pid=709328) AssertionError
Mar 12 08:03:29 amano unglitched_vllm[709228]: (APIServer pid=709228) INFO: 192.168.1.253:50854 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
Mar 12 08:03:29 amano unglitched_vllm[709228]: (APIServer pid=709228) INFO: Shutting down
Mar 12 08:03:29 amano unglitched_vllm[709328]: [rank0]:[W312 08:03:29.392925091 ProcessGroupNCCL.cpp:1593] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Mar 12 08:03:29 amano unglitched_vllm[709228]: (APIServer pid=709228) INFO: Waiting for application shutdown.

NVIDIA org

What configurations are you trying and on what hardware?

Thanks!

CUDA_HOME=/usr/local/cuda-13.0 -e C_INCLUDE_PATH=/usr/local/cuda-13.0/include -e LIBRARY_PATH=/usr/lib/aarch64-linux-gnu/nvidia -e FLASHINFER_NVCC=/usr/local/cuda-13.0/bin/nvcc ~/bin/vllm --trust-remote-code --served-model-name Nikola --port 9000 --enable-auto-tool-choice --kv-cache-dtype fp8 --tool-call-parser qwen3_coder --reasoning-parser-plugin ./super_v3_reasoning_parser.py --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' --default-chat-template-kwargs '{"enable_thinking": false}' --enable-prefix-caching --max_num_batched_tokens 8192 --async-scheduling --enable-chunked-prefill --reasoning-parser super_v3 --model "$@"

This is on NVIDIA Thor Dev Kit. It would help to get an updated https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm that is known to work

Confirmed that it doesn't work with official nvcr.io/nvidia/vllm:26.02-py3 container either

Sign up or log in to comment