mratsim/MiniMax-M2.5-BF16-INT4-AWQ · Cant get it to work on 8x RTX3090

Cant get it to work on 8x RTX3090

by maglat - opened Feb 14

Feb 14

I cant get vLLM to start up with M2.5 I always get a CUDA out of memory error

"RuntimeError: Worker failed with error 'CUDA out of memory. Tried to allocate 108.00 MiB. GPU 7 has a total capacity of 23.56 GiB of which 95.25 MiB is free. Including non-PyTorch memory, this process has 23.18 GiB memory in use. Of the allocated memory 22.22 GiB is allocated by PyTorch, and 506.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)', please check the stack trace above for the root cause"

I start it up in docker with following command

docker run -d
--name vllm-minimax-m2_5
--restart unless-stopped
-p 8788:8000
-v /mnt/extra/models:/root/.cache/huggingface
--gpus '"device=0,1,2,8,4,5,6,7"'
-e CUDA_DEVICE_ORDER=PCI_BUS_ID
--ipc=host
vllm/vllm-openai:latest-cu130
mratsim/Minimax-M2.5-BF16-INT4-AWQ
--tensor-parallel-size 8
--max-num-seqs 2
--max-model-len 196608
--gpu-memory-utilization 0.96
--override-generation-config '{"temperature": 1, "top_p": 0.95, "top_k": 40, "repetition_penalty": 1.1, "frequency_penalty": 0.40}'
--kv-cache-dtype fp8
--reasoning-parser minimax_m2
--tool-call-parser minimax_m2
--served-model-name minimax-m2.5
--enable-auto-tool-choice
--disable-custom-all-reduce
--trust-remote-code

The same worked well with M2.1 btw. For M2.1 I used cyankiwi/MiniMax-M2.1-AWQ-4bit .
Any idea whats going on this time?

maglat changed discussion title from Cant get it to work in 8x RTX3090 to Cant get it to work on 8x RTX3090 Feb 14

mratsim

Owner Feb 14

•

edited Feb 14

It should work on 8x3090 see https://huggingface.co/mratsim/MiniMax-M2.1-BF16-INT4-AWQ/discussions/1

However can you try with less context first to rule out other issues apart from VRAM?

My model uses mixed precision, self-attention is left unquantized so it's like ~3GB bigger than cyankiwi's weight. Given that you have 0.96 gpu-memory-utilization that might push you over the edge.

ciprianv

Feb 14

try gpu-memory-utilization 0.9 max-model-len auto

maglat

Feb 14

Thank you. I got it to start up with

docker run -d
--name vllm-minimax-m2_5
--restart unless-stopped
-p 8788:8000
-v /mnt/extra/models:/root/.cache/huggingface
--gpus '"device=0,1,2,8,4,5,6,7"'
-e CUDA_DEVICE_ORDER=PCI_BUS_ID
--ipc=host
vllm/vllm-openai:latest-cu130
mratsim/Minimax-M2.5-BF16-INT4-AWQ
--tensor-parallel-size 8
--max-num-seqs 2
--max-model-len auto
--gpu-memory-utilization 0.9
--override-generation-config '{"temperature": 1, "top_p": 0.95, "top_k": 40, "repetition_penalty": 1.1, "frequency_penalty": 0.40}'
--kv-cache-dtype fp8
--reasoning-parser minimax_m2
--tool-call-parser minimax_m2
--served-model-name minimax-m2.5
--enable-auto-tool-choice
--disable-custom-all-reduce
--trust-remote-code

maglat

Feb 14

Even with

--max-model-len 196608 \

lowering GPU memory utilization did the trick. I thought, bigger is better :D
--gpu-memory-utilization 0.9 \

ciprianv

Feb 14

can you tell me you tg speed, please? I want to know how much i am losing using 4 pcs with 2 x3090 each, vs all 8 in same machine. I am getting 64t/s with one request. 110t/s with 2 parallel requests.

mratsim

Owner Feb 15

Following accuracy degradation concerns after using the new batch_size=32 feature in LLMcompressor I have reuploaded quants with batch_size=1 to ensure my calibration dataset is passed as-is and not truncated to the shortest sequence in the batch. Please redownload for highest quality! (see thread https://huggingface.co/mratsim/MiniMax-M2.5-BF16-INT4-AWQ/discussions/4)

dehnhaide

Feb 20

Thank you. I got it to start up with
--kv-cache-dtype fp8 \

Although we have a pretty similar setup (8x RTX3090) for me the "--kv-cache-dtype fp8" crashes vllm with:
"(EngineCore_DP0 pid=224079) ERROR 02-20 17:47:45 [core.py:946] RuntimeError: Worker failed with error 'float8 types are not supported by dlpack', please check the stack trace above for the root cause"

dehnhaide

Feb 22

•

edited Feb 22

can you tell me you tg speed, please? I want to know how much i am losing using 4 pcs with 2 x3090 each, vs all 8 in same machine. I am getting 64t/s with one request. 110t/s with 2 parallel requests.

At the beginnning of the context window (sized at max of 196608):
INFO 02-22 18:56:04 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 77.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.5%, Prefix cache hit rate: 0.1%

ciprianv

Feb 22

Thank you. It looks like cca 20% is lost because of network overhead. Not bad :)

dehnhaide

Feb 22

•

edited Feb 22

can you tell me you tg speed, please? I want to know how much i am losing using 4 pcs with 2 x3090 each, vs all 8 in same machine. I am getting 64t/s with one request. 110t/s with 2 parallel requests.

Or, with P2P enabled + modded Nvidia drivers + vLLM trick and "--disable-custom-all-reduce" removed! ;)
INFO 02-22 19:40:17 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 94.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.0%, Prefix cache hit rate: 0.7%

PowAG

Feb 23

•

edited Feb 23

fyi, I got 95 tps with single request via vllm&flashinfer&nop2p while running with 4x 4090 48G
flashinfer works better than flashattn

ciprianv

Feb 23

without p2p enabled? can you share your command, please?

dehnhaide

Feb 23

without p2p enabled? can you share your command, please?

If I get 90-95 on 8x 3090, his 95 on 4x 4090 (less PCIe overhead + faster GPU) makes sense... Even without p2p! My 2c!

PowAG

Feb 24

GPU_UTIL="${GPU_UTIL:-0.95}"
MAX_MODEL_LEN="${MAX_MODEL_LEN:-196608}"
MAX_NUM_SEQS="${MAX_NUM_SEQS:-32}"

SAMPLER_OVERRIDE='{"temperature": 1, "top_p": 0.95, "top_k": 40, "repetition_penalty": 1.1, "frequency_penalty": 0.40}'

export PYTORCH_ALLOC_CONF="expandable_segments:True,max_split_size_mb:512"
export VLLM_SLEEP_WHEN_IDLE="1"
export VLLM_ALLREDUCE_USE_SYMM_MEM="0"
export TORCH_ALLOW_TF32_CUBLAS_OVERRIDE="1"
export OMP_NUM_THREADS="10"
export VLLM_FLOAT32_MATMUL_PRECISION="high"
export USE_FASTSAFETENSOR="1"
export VLLM_WORKER_MULTIPROC_METHOD="spawn"
export SAFETENSORS_FAST_GPU="1"
export VLLM_ALLOW_LONG_MAX_MODEL_LEN="1"

IMAGE="vllm/vllm-openai:nightly"

docker run --rm --runtime nvidia --gpus all \
  -v "${HOST_MODEL_DIR}:${CONTAINER_MODEL_DIR}:ro" \
  -v "${HOME}/.cache/huggingface:/root/.cache/huggingface" \
  -v "${HOME}/.cache/vllm:/root/.cache/vllm" \
  -e LD_LIBRARY_PATH=/lib/x86_64-linux-gnu:/usr/local/cuda/lib64 \
  -e USE_FASTSAFETENSOR \
  -e TORCH_ALLOW_TF32_CUBLAS_OVERRIDE \
  -e VLLM_FLOAT32_MATMUL_PRECISION \
  -e PYTORCH_ALLOC_CONF \
  -e OMP_NUM_THREADS \
  -e VLLM_SLEEP_WHEN_IDLE \
  -e VLLM_ALLREDUCE_USE_SYMM_MEM \
  -e VLLM_WORKER_MULTIPROC_METHOD \
  -e SAFETENSORS_FAST_GPU \
  -e VLLM_ALLOW_LONG_MAX_MODEL_LEN \
  -p 8000:8000 \
  --ipc=host \
  --entrypoint python3 \
  "${IMAGE}" \
  -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 \
    --port 8000 \
    --model "${CONTAINER_MODEL_DIR}" \
    --served-model-name "${MODELNAME}" \
    --trust-remote-code \
    --block-size 16 \
    --disable-custom-all-reduce \
    --gpu-memory-utilization "${GPU_UTIL}" \
    --max-model-len "${MAX_MODEL_LEN}" \
    --tensor-parallel-size 4 \
    --enable-chunked-prefill \
    --max-num-batched-tokens 2048 \
    --max-num-seqs "${MAX_NUM_SEQS}" \
    --override-generation-config "${SAMPLER_OVERRIDE}" \
    --enable-auto-tool-choice \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think \
    --disable-uvicorn-access-log \
    --max-cudagraph-capture-size 64 \
    --attention-backend flashinfer

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment