Can't get the model to run on vllm 0.19.1rc1
Here's my vllm serve command:
vllm serve models/gemma4-26b \
--served-model-name gemma \
--port 9180 \
--max-model-len 32768 \
--max-num-seqs 8 \
--gpu-memory-utilization 0.55 \
--mm-processor-cache-gb 0 \
--limit-mm-per-prompt '{"video":0}' \
--reasoning-parser gemma4 \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--load-format fastsafetensors \
--enable-prefix-caching
I keep getting this error: KeyError: 'layers.0.experts.0.down_proj.qweight'
I tried the mixed quant too and I get this error with that model: ValueError: Fused module 'language_model.model.layers.5.self_attn.qkv_proj' requires consistent quant config for ['language_model.model.layers.5.self_attn.q_proj', 'language_model.model.layers.5.self_attn.k_proj', 'language_model.model.layers.5.self_attn.v_proj']
Hey! I actually ran into the exact same issue trying to serve it with vLLM. It looks like vLLM just doesn't fully support this AutoRound quantized Gemma-4 26B model yet.
It works perfectly fine if you just load it with the standard transformers library instead. I put together a quick test notebook showing how to get it running:
https://github.com/vishvaRam/AutoRound-Quantaization/blob/main/Gemma-4-26B-AutoRound-Test.ipynb
Hope this helps for now while we wait for vLLM to add support!