GPU: 5060Ti*4, vllm version 0.11. The following error occurred:

#1
by lsm03624 - opened

GPU: 5060Ti*4, vllm version 0.11. The following error occurred:
ValueError: Failed to find a kernel that can implement the WNA16 linear layer. Reasons:
(Worker_TP1_EP1 pid=153745) ERROR 11-08 11:29:46 [multiprocexecutor.py:597] The CutlassW4A8LinearKernel cannot be used because CUTLASS W4A8 requires a compute capability of level 90 (Hopper).
(Worker_TP1_EP1 pid=153745) ERROR 11-08 11:29:46 [multiprocexecutor.py:597] The MacheteLinearKernel cannot be used because Machete requires a compute capability of level 90 (Hopper).
(Worker_TP1_EP1 pid=153745) ERROR 11-08 11:29:46 [multiprocexecutor.py:597] The AllSparkLinearKernel cannot be used because AllSpark currently does not support devicecapability equal to 120.
(Worker_TP1_EP1 pid=153745) ERROR 11-08 11:29:46 [multiprocexecutor.py:597] The MarlinLinearKernel cannot be used because the value of weight_input_size_per_partition (2736) is not divisible by min_thread_k (128). Consider reducing the value of tensor_parallel_size or running the program with the --quantization gptq option.

Yep, sorry, the compressed tensors stuff it up. Try my AWQ version: https://huggingface.co/MidnightPhreaker/GLM-4.5-Air-REAP-82B-A12B-AWQ-4bit

Its what I use on my 4* 5060Ti 's :)

No Reasoning:

sudo docker run --runtime nvidia --gpus all --detach
--name "vLLM"
-v ~/.cache/huggingface:/root/.cache/huggingface
-p 10.1.1.10:80:80
--ipc=host
-e "CUDA_VISIBLE_DEVICES=0,1,2,3"
-e "VLLM_TARGET_DEVICE=cuda"
-e "CUDA_DEVICE_ORDER=PCI_BUS_ID"
-e "PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True"
-e "VLLM_USE_FLASHINFER_SAMPLER=1"
-e "VLLM_ATTENTION_BACKEND=FLASHINFER"
vllm/vllm-openai:latest
--model "MidnightPhreaker/GLM-4.5-Air-REAP-82B-A12B-AWQ-4bit"
--max-model-len 131072
--dtype bfloat16
--kv-cache-dtype fp8
--gpu-memory-utilization 0.92
--tensor-parallel-size 4
--enable-expert-parallel
--swap-space 0
--max-num-seqs 9
--enable-auto-tool-choice
--tool-call-parser glm45
--host 0.0.0.0
--port 80
--served-model-name "GLM-4.5-Air"
--enable-prefix-caching
--enable-chunked-prefill

Reasoning:

sudo docker run --runtime nvidia --gpus all --detach
--name "vLLM"
-v ~/.cache/huggingface:/root/.cache/huggingface
-p 10.1.1.10:80:80
--ipc=host
-e "CUDA_VISIBLE_DEVICES=0,1,2,3"
-e "VLLM_TARGET_DEVICE=cuda"
-e "CUDA_DEVICE_ORDER=PCI_BUS_ID"
-e "PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True"
-e "VLLM_USE_FLASHINFER_SAMPLER=1"
-e "VLLM_ATTENTION_BACKEND=FLASHINFER"
vllm/vllm-openai:latest
--model "MidnightPhreaker/GLM-4.5-Air-REAP-82B-A12B-AWQ-4bit"
--max-model-len 131072
--dtype bfloat16
--kv-cache-dtype fp8
--gpu-memory-utilization 0.92
--tensor-parallel-size 4
--enable-expert-parallel
--swap-space 0
--max-num-seqs 9
--enable-auto-tool-choice
--tool-call-parser glm45
--reasoning-parser glm45
--host 0.0.0.0
--port 80
--served-model-name "GLM-4.5-Air"
--enable-prefix-caching
--enable-chunked-prefill
docker logs -f vLLM

Sign up or log in to comment