RTX Pro 6000 support
VLLM 13.1 is crashing with:
bmm_fp8_internal_cublaslt failed: the library was not initialized
Hi @justinjja , be sure to have v0.17.1 installed. I was able to run NVFP4 version of Nemotron-3-Super using vLLM and SGLang as well.
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
--async-scheduling
--dtype auto
--kv-cache-dtype fp8
--tensor-parallel-size 1
--pipeline-parallel-size 1
--data-parallel-size 1
--swap-space 0
--trust-remote-code
--attention-backend TRITON_ATTN
--gpu-memory-utilization 0.9
--enable-chunked-prefill
--max-num-seqs 512
--host 0.0.0.0
--port 5000
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--reasoning-parser-plugin "./super_v3_reasoning_parser.py"
--reasoning-parser super_v3
@shakhizat If you've got it working reliably and you have the time to spare I'm sure everyone here would appreciate if you could make a brief end-to-end guide to getting it set up including the commands needed to build a compatible container.
With cuda 13.0 support uisng vllm on Blackwell 6000 Pro:
uv venv .vllm --python 3.12
source .vllm/bin/activate
export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest | jq -r .tag_name | sed 's/^v//')
export CUDA_VERSION=130
export CPU_ARCH=$(uname -m)
uv pip install
https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu${CUDA_VERSION}-cp38-abi3-manylinux_2_35_${CPU_ARCH}.whl
--extra-index-url https://download.pytorch.org/whl/cu${CUDA_VERSION}
--index-strategy unsafe-best-match
Here is command to run on the Nvidia Jetson Thor: https://forums.developer.nvidia.com/t/running-nvidia-nemotron-3-super-120b-a12b-nvfp4-on-the-nvidia-jetson-thor/363485