Cant get it to work on 8x RTX3090
I cant get vLLM to start up with M2.5 I always get a CUDA out of memory error
"RuntimeError: Worker failed with error 'CUDA out of memory. Tried to allocate 108.00 MiB. GPU 7 has a total capacity of 23.56 GiB of which 95.25 MiB is free. Including non-PyTorch memory, this process has 23.18 GiB memory in use. Of the allocated memory 22.22 GiB is allocated by PyTorch, and 506.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)', please check the stack trace above for the root cause"
I start it up in docker with following command
docker run -d
--name vllm-minimax-m2_5
--restart unless-stopped
-p 8788:8000
-v /mnt/extra/models:/root/.cache/huggingface
--gpus '"device=0,1,2,8,4,5,6,7"'
-e CUDA_DEVICE_ORDER=PCI_BUS_ID
--ipc=host
vllm/vllm-openai:latest-cu130
mratsim/Minimax-M2.5-BF16-INT4-AWQ
--tensor-parallel-size 8
--max-num-seqs 2
--max-model-len 196608
--gpu-memory-utilization 0.96
--override-generation-config '{"temperature": 1, "top_p": 0.95, "top_k": 40, "repetition_penalty": 1.1, "frequency_penalty": 0.40}'
--kv-cache-dtype fp8
--reasoning-parser minimax_m2
--tool-call-parser minimax_m2
--served-model-name minimax-m2.5
--enable-auto-tool-choice
--disable-custom-all-reduce
--trust-remote-code
The same worked well with M2.1 btw. For M2.1 I used cyankiwi/MiniMax-M2.1-AWQ-4bit .
Any idea whats going on this time?
It should work on 8x3090 see https://huggingface.co/mratsim/MiniMax-M2.1-BF16-INT4-AWQ/discussions/1
However can you try with less context first to rule out other issues apart from VRAM?
My model uses mixed precision, self-attention is left unquantized so it's like ~3GB bigger than cyankiwi's weight. Given that you have 0.96 gpu-memory-utilization that might push you over the edge.
try gpu-memory-utilization 0.9 max-model-len auto
Thank you. I got it to start up with
docker run -d
--name vllm-minimax-m2_5
--restart unless-stopped
-p 8788:8000
-v /mnt/extra/models:/root/.cache/huggingface
--gpus '"device=0,1,2,8,4,5,6,7"'
-e CUDA_DEVICE_ORDER=PCI_BUS_ID
--ipc=host
vllm/vllm-openai:latest-cu130
mratsim/Minimax-M2.5-BF16-INT4-AWQ
--tensor-parallel-size 8
--max-num-seqs 2
--max-model-len auto
--gpu-memory-utilization 0.9
--override-generation-config '{"temperature": 1, "top_p": 0.95, "top_k": 40, "repetition_penalty": 1.1, "frequency_penalty": 0.40}'
--kv-cache-dtype fp8
--reasoning-parser minimax_m2
--tool-call-parser minimax_m2
--served-model-name minimax-m2.5
--enable-auto-tool-choice
--disable-custom-all-reduce
--trust-remote-code
Even with
--max-model-len 196608 \
lowering GPU memory utilization did the trick. I thought, bigger is better :D
--gpu-memory-utilization 0.9 \
can you tell me you tg speed, please? I want to know how much i am losing using 4 pcs with 2 x3090 each, vs all 8 in same machine. I am getting 64t/s with one request. 110t/s with 2 parallel requests.
Following accuracy degradation concerns after using the new batch_size=32 feature in LLMcompressor I have reuploaded quants with batch_size=1 to ensure my calibration dataset is passed as-is and not truncated to the shortest sequence in the batch. Please redownload for highest quality! (see thread https://huggingface.co/mratsim/MiniMax-M2.5-BF16-INT4-AWQ/discussions/4)
Thank you. I got it to start up with
--kv-cache-dtype fp8 \
Although we have a pretty similar setup (8x RTX3090) for me the "--kv-cache-dtype fp8" crashes vllm with:
"(EngineCore_DP0 pid=224079) ERROR 02-20 17:47:45 [core.py:946] RuntimeError: Worker failed with error 'float8 types are not supported by dlpack', please check the stack trace above for the root cause"
can you tell me you tg speed, please? I want to know how much i am losing using 4 pcs with 2 x3090 each, vs all 8 in same machine. I am getting 64t/s with one request. 110t/s with 2 parallel requests.
At the beginnning of the context window (sized at max of 196608):
INFO 02-22 18:56:04 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 77.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.5%, Prefix cache hit rate: 0.1%
Thank you. It looks like cca 20% is lost because of network overhead. Not bad :)
can you tell me you tg speed, please? I want to know how much i am losing using 4 pcs with 2 x3090 each, vs all 8 in same machine. I am getting 64t/s with one request. 110t/s with 2 parallel requests.
Or, with P2P enabled + modded Nvidia drivers + vLLM trick and "--disable-custom-all-reduce" removed! ;)
INFO 02-22 19:40:17 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 94.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.0%, Prefix cache hit rate: 0.7%
fyi, I got 95 tps with single request via vllm&flashinfer&nop2p while running with 4x 4090 48G
flashinfer works better than flashattn
without p2p enabled? can you share your command, please?
without p2p enabled? can you share your command, please?
If I get 90-95 on 8x 3090, his 95 on 4x 4090 (less PCIe overhead + faster GPU) makes sense... Even without p2p! My 2c!
GPU_UTIL="${GPU_UTIL:-0.95}"
MAX_MODEL_LEN="${MAX_MODEL_LEN:-196608}"
MAX_NUM_SEQS="${MAX_NUM_SEQS:-32}"
SAMPLER_OVERRIDE='{"temperature": 1, "top_p": 0.95, "top_k": 40, "repetition_penalty": 1.1, "frequency_penalty": 0.40}'
export PYTORCH_ALLOC_CONF="expandable_segments:True,max_split_size_mb:512"
export VLLM_SLEEP_WHEN_IDLE="1"
export VLLM_ALLREDUCE_USE_SYMM_MEM="0"
export TORCH_ALLOW_TF32_CUBLAS_OVERRIDE="1"
export OMP_NUM_THREADS="10"
export VLLM_FLOAT32_MATMUL_PRECISION="high"
export USE_FASTSAFETENSOR="1"
export VLLM_WORKER_MULTIPROC_METHOD="spawn"
export SAFETENSORS_FAST_GPU="1"
export VLLM_ALLOW_LONG_MAX_MODEL_LEN="1"
IMAGE="vllm/vllm-openai:nightly"
docker run --rm --runtime nvidia --gpus all \
-v "${HOST_MODEL_DIR}:${CONTAINER_MODEL_DIR}:ro" \
-v "${HOME}/.cache/huggingface:/root/.cache/huggingface" \
-v "${HOME}/.cache/vllm:/root/.cache/vllm" \
-e LD_LIBRARY_PATH=/lib/x86_64-linux-gnu:/usr/local/cuda/lib64 \
-e USE_FASTSAFETENSOR \
-e TORCH_ALLOW_TF32_CUBLAS_OVERRIDE \
-e VLLM_FLOAT32_MATMUL_PRECISION \
-e PYTORCH_ALLOC_CONF \
-e OMP_NUM_THREADS \
-e VLLM_SLEEP_WHEN_IDLE \
-e VLLM_ALLREDUCE_USE_SYMM_MEM \
-e VLLM_WORKER_MULTIPROC_METHOD \
-e SAFETENSORS_FAST_GPU \
-e VLLM_ALLOW_LONG_MAX_MODEL_LEN \
-p 8000:8000 \
--ipc=host \
--entrypoint python3 \
"${IMAGE}" \
-m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port 8000 \
--model "${CONTAINER_MODEL_DIR}" \
--served-model-name "${MODELNAME}" \
--trust-remote-code \
--block-size 16 \
--disable-custom-all-reduce \
--gpu-memory-utilization "${GPU_UTIL}" \
--max-model-len "${MAX_MODEL_LEN}" \
--tensor-parallel-size 4 \
--enable-chunked-prefill \
--max-num-batched-tokens 2048 \
--max-num-seqs "${MAX_NUM_SEQS}" \
--override-generation-config "${SAMPLER_OVERRIDE}" \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--disable-uvicorn-access-log \
--max-cudagraph-capture-size 64 \
--attention-backend flashinfer