Can't get the model to run on vllm 0.19.1rc1

#1
by toughcent - opened

Here's my vllm serve command:

vllm serve models/gemma4-26b \
    --served-model-name gemma \
    --port 9180 \
    --max-model-len 32768 \
    --max-num-seqs 8 \
    --gpu-memory-utilization 0.55 \
    --mm-processor-cache-gb 0 \
    --limit-mm-per-prompt '{"video":0}' \
    --reasoning-parser gemma4 \
    --enable-auto-tool-choice \
    --tool-call-parser gemma4 \
    --load-format fastsafetensors \
    --enable-prefix-caching 

I keep getting this error: KeyError: 'layers.0.experts.0.down_proj.qweight'

I tried the mixed quant too and I get this error with that model: ValueError: Fused module 'language_model.model.layers.5.self_attn.qkv_proj' requires consistent quant config for ['language_model.model.layers.5.self_attn.q_proj', 'language_model.model.layers.5.self_attn.k_proj', 'language_model.model.layers.5.self_attn.v_proj']

Hey! I actually ran into the exact same issue trying to serve it with vLLM. It looks like vLLM just doesn't fully support this AutoRound quantized Gemma-4 26B model yet.

It works perfectly fine if you just load it with the standard transformers library instead. I put together a quick test notebook showing how to get it running:
https://github.com/vishvaRam/AutoRound-Quantaization/blob/main/Gemma-4-26B-AutoRound-Test.ipynb

Hope this helps for now while we wait for vLLM to add support!

Sign up or log in to comment