Vllm command?

by tkg61 - opened 10 days ago

Discussion

tkg61

10 days ago

Is this supported in vllm?

nmitchko

8 days ago

It should work fine, GLM-5 works as expected on vllm.

Here's the equivalent on minimax-2.5-nvfp4, swap out the model type and the rool calling parser / reasoning parser

export HF_HOME=/path/to/huggingface
export HUGGINGFACE_HUB_CACHE=$HF_HOME/hub
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export SAFETENSORS_FAST_GPU=1
export VLLM_NVFP4_GEMM_BACKEND=cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0
export NCCL_IB_DISABLE=1
export OMP_NUM_THREADS=8
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

python -m vllm.entrypoints.openai.api_server
--model lukealonso/MiniMax-M2.5-NVFP4
--download-dir $HUGGINGFACE_HUB_CACHE
--host 0.0.0.0
--port 1235
--served-model-name MiniMax-M2.5-NVFP4
--trust-remote-code
--tensor-parallel-size 8
--attention-backend FLASH_ATTN
--gpu-memory-utilization 0.95
--max-model-len 190000
--max-num-batched-tokens 16384
--max-num-seqs 64
--disable-custom-all-reduce
--enable-auto-tool-choice
--tool-call-parser minimax_m2
--reasoning-parser minimax_m2_append_think

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment