High First Token Latency Issue with AWQ-4bit Model Deployment Using vLLM

#2
by Jeanxx - opened

Hello, I deployed your quantized cyankiwi/gemma-4-31B-it-AWQ-4bit model using vllm==0.19.0 and transformers==5.5.0. The startup command I used is:

/miniconda/vllm/bin/python -m vllm.entrypoints.openai.api_server
--model ./models--cyankiwi--gemma-4-31B-it-AWQ-4bit
--served-model-name Model
--max-num-seqs 3
--gpu-memory-utilization 0.85
--max-model-len 15k
--port 8001
--tensor_parallel_size 2
--enable-auto-tool-choice
--default-chat-template-kwargs '{"enable_thinking": false}'
--language-model-only
--tool-call-parser gemma4
--reasoning-parser gemma4 >> vllm.log 2>&1 &

However, I've noticed that the first token latency can sometimes be extremely long.

hello, I try to load this version in anaconda with rtx 3090 on ubuntu 22.04 which has 6.8.0.40 kernel.
My nvidia-driver version is 550.135.
I created a env. python=3.11 .
Then pip install vllm. But the default version is 0.13.0. I upgrade it to 0.19.0.
After that i try to install transformers=5.5.0.
It reports that: vllm 0.19.0 requires transformers<5,>=4.56.0, but you have transformers 5.5.0 which is incompatible.
why? Can i ignore that?

In my environment, I originally had pip install vllm==0.18.0, then in order to deploy gemma4, I ran pip install vllm==0.19.0, followed by pip install transformers==5.5.0, and then I found that it could be deployed successfully.

In my environment, I originally had pip install vllm==0.18.0, then in order to deploy gemma4, I ran pip install vllm==0.19.0, followed by pip install transformers==5.5.0, and then I found that it could be deployed successfully.

Thank you for your kind. I upgraded my nvidia driver to 580.59.08. A new env has been created in anaconda. I startup the vllm container successfully by the following command:

docker run --gpus all
--runtime nvidia
--ipc=host
-v "$MODEL_PATH:/model"
-p 8000:8000
vllm/vllm-openai:gemma4-cu130
--model /model
--served-model-name gemma-4-31b
--dtype bfloat16
--quantization compressed-tensors
--max-model-len 2048
--max-num-seqs 8
--gpu-memory-utilization 0.95
--trust-remote-code

I test it by curl. Well, it worked. The only limit is 'max-model-len' is 2048

Sign up or log in to comment