QuantTrio/Qwen3.5-397B-A17B-AWQ reponse is !
After the model is launched, all outputs returned by accessing it are exclamation marks (!). What could be the cause of this issue? The a10-awq model works normally in this environment.
device a800
ONTEXT_LENGTH=32768
export CUDA_VISIBLE_DEVICES="0,1,2,3"
vllm serve
//models/Qwen3.5/Qwen3.5-397B-A17B-AWQ
--served-model-name Qwen3.5-397B-A17B-AWQ
--enable-expert-parallel
--swap-space 16
--max-num-seqs 32
--max-model-len $CONTEXT_LENGTH
--gpu-memory-utilization 0.9
--tensor-parallel-size 4
--reasoning-parser qwen3
--mm-processor-cache-type shm
--mm-encoder-tp-mode data
--enable-prefix-caching
--host 0.0.0.0
--port 8086 \
https://huggingface.co/QuantTrio/Qwen3.5-397B-A17B-AWQ/discussions/3
double check if you're not using cuda 12.8/13.0
I'm using Cuda 13.0 and am getting "!!!!!" , I cleared the cache and still get nothing but !!!!!!!!>
I'm using Cuda 13.0 and am getting "!!!!!" , I cleared the cache and still get nothing but !!!!!!!!>
did you use the docker image same as others?
I was using vllm/vllm-openai:cu130-nightly
https://huggingface.co/QuantTrio/Qwen3.5-397B-A17B-AWQ/discussions/3
double check if you're not using cuda 12.8/13.0
my torch cuda version is 12.8;
torch.version.cuda
'12.8'
Please download / replace with the new config.json file from this repo, and have a try one more time. Let me know if this can help resolve the issue.
Please download / replace with the new
config.jsonfile from this repo, and have a try one more time. Let me know if this can help resolve the issue.
this config.json is ok!
Please download / replace with the new config.json file from this repo, and have a try one more time. Let me know if this can help resolve the issue.
The model works as is without updating anything when running on my 8xA6000 server, but when trying to run it on my Turing server 8xQuadro RTX 8000 which doesn't support BF16 it still gives me "!!!!!!" even with the new config.json.
My command:
VLLM_USE_ATOMIC_ADD=1 VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve Qwen3.5-397B-A17B-AWQ/ --reasoning-parser qwen3 --enable-prefix-caching --enable-auto-tool-choice --tool-call-parser qwen3_coder --attention-backend FLASHINFER -tp 8 --enable-expert-parallel --gpu-memory-utilization 0.9 --trust-remote-code --limit-mm-per-prompt '{"video":0}' --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":1}' --mm-processor-cache-type shm
NVCC returns Cuda 13.2.
Torch returns: 2.10.0+cu128
Honestly out of ideas.