Can't get the model to run on vllm 0.19.1rc1

by toughcent - opened 15 days ago

Here's my vllm serve command:

vllm serve models/gemma4-26b \
    --served-model-name gemma \
    --port 9180 \
    --max-model-len 32768 \
    --max-num-seqs 8 \
    --gpu-memory-utilization 0.55 \
    --mm-processor-cache-gb 0 \
    --limit-mm-per-prompt '{"video":0}' \
    --reasoning-parser gemma4 \
    --enable-auto-tool-choice \
    --tool-call-parser gemma4 \
    --load-format fastsafetensors \
    --enable-prefix-caching

I keep getting this error: KeyError: 'layers.0.experts.0.down_proj.qweight'

I tried the mixed quant too and I get this error with that model: ValueError: Fused module 'language_model.model.layers.5.self_attn.qkv_proj' requires consistent quant config for ['language_model.model.layers.5.self_attn.q_proj', 'language_model.model.layers.5.self_attn.k_proj', 'language_model.model.layers.5.self_attn.v_proj']

Vishva007

13 days ago

Hey! I actually ran into the exact same issue trying to serve it with vLLM. It looks like vLLM just doesn't fully support this AutoRound quantized Gemma-4 26B model yet.

It works perfectly fine if you just load it with the standard transformers library instead. I put together a quick test notebook showing how to get it running:
https://github.com/vishvaRam/AutoRound-Quantaization/blob/main/Gemma-4-26B-AutoRound-Test.ipynb

Hope this helps for now while we wait for vLLM to add support!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment