Model seems to have issues in vLLM (characters duplication)

#15

by dehnhaide - opened 9 days ago

Got it working with "cyankiwi/gemma-4-31B-it-AWQ-8bit" on 8x RTX 3090 (I know, overkill for this small dense model) with the command listed below.

vllm serve cyankiwi/gemma-4-31B-it-AWQ-8bit --served-model-name "cyankiwi/gemma-4-31B-it-AWQ-8bit"
--tensor-parallel-size 8
--max-model-len 192768
--gpu-memory-utilization 0.85
--max-num-seqs 4
--max-num-batched-tokens 2048
--tool-call-parser gemma4
--enable-auto-tool-choice
--reasoning-parser gemma4
--host 0.0.0.0 --port 5005
--disable-uvicorn-access-log
--limit-mm-per-prompt '{"image":4}'
--override-generation-config '{"temperature":1.0,"top_p":0.95,"top_k":64}'
--trust-remote-code
--enable-prefix-caching
--disable-custom-all-reduce

Average speed: 55-60 tks

However, for the quality of model ... something is really off (with Opencode). Asked it to "Create a simple flask application with a simple HTML, CSS and JS frontend. It should manage todos." and it loops indefinitely since it doubles like crazy on html tags (names + "<<"). Not sure if it's the quant or... the model...

dehnhaide changed discussion title from Model seems to have issues to Model seems to have issues (characters duplication) 9 days ago

dehnhaide changed discussion title from Model seems to have issues (characters duplication) to Model seems to have issues in vLLM (characters duplication) 9 days ago

Ciosiek

9 days ago

Can you recheck with full size model?

wt1314

8 days ago

The full size still has this issue.

dehnhaide

8 days ago

The full size

Can you recheck with full size model?

I have tried with the full size too. The issue is not present there however there are other issues present in that one, related to tool calling (at least from Opencode) resulting in looping. However this character duplication is NOT.

zuuky

8 days ago

Is there a solution to this problem?

srikanta-221

Google org 5 days ago

Hi @dehnhaide , thanks for sharing detailed setup, this is super helpful.

On the behavior you are seeing, this doesn't appear to be fundamental issue with the base Gemma 4 31B model itself. it's more consistent with an interaction between decoding, structured output handling and the serving stack in vLLM.

In particular, when using --tool-call-parser gemma4 together with streaming, outputs that contain lots of < /> tokens can sometimes lead to unstable reconstruction of partial sequences. This can manifest as duplicated tags or feedback loops during generation. There is an upstream fix addressing this behavior that has been approved and is pending merge, so this should improve post that.

Your sampling configuration may also be contributing. With temperature=1.0, top_p=0.95, and top_k=64, the decoding is relatively high entropy, which can amplify repetition or malformed structure in code heavy outputs. For this kind of task, lowering temperature to around 0.2-0.4 and slightly tightening top_p typically leads to much more stable results.

It's also worth noting that AWQ quantization generally preserves overall capability, but structured generation can be more sensitive to quantization. So it's still useful to validate whether the issue reproduces on a non quantized baseline.

To isolate the cause, a few quick checks usually help:

Disable tool calling: Try removing the parser and auto-tool flags to see if the issue disappers.
Adjust sampling: Test with a lower temperature.
Disable streaming: Run in non-streaming mode to rule out partial token reconstruction effects.
Test baseline weights: If you have the headroom, testing the bf16 version of the model can help confirm whether quantization is a factor.

Finally, regarding your hardware, you are right that 8x RTX 3090 is more than sufficient. For a dense 31B model, using tensor parallel size 4 instead of 8 often improves throughput on PCIe systems due to reduced communication overhead, while still leaving plenty of room for KV cache and batching.

dehnhaide

5 days ago

•

edited 5 days ago

Dear @srikanta-221 , first of all, thanks a lot for taking the time to try to guide my in the little (or maybe bigger) mess Gemma4 is right now, at least for the vllm serving env. The idea with tweaking the "temperature" and "top_p" while good, is counter-intuitive to Google's own model card that quotes:

"Use the following standardized sampling configuration across all use cases:
temperature=1.0
top_p=0.95
top_k=64

Tried <-- and btw, for all testing on, I am only using the BF16 quant release, so AWQ quantization doing weird stuff to the model is out of the question! --> and "temperature" and "top_p" adjustment helped not.
Also, disabling tool calling for me is a big no_no since that is precisely the use case I am targeting Gemma4 for.

The Real Problem: Gemma4 on vLLM is very buggy for the time being

Heterogeneous attention head dims → forced Triton fallback
Gemma4's heterogeneous attention head dimensions (head_dim=256, global_head_dim=512) force vLLM to disable FlashAttention and fall back to a much slower Triton attention kernel, with custom_ops set to ['none'], meaning no vLLM-native CUDA kernels are used at all. --> This affects correctness and throughput.
Reasoning parser strips special tokens before parsing
The Gemma4ReasoningParser fails to populate reasoning_content in OpenAI chat completions responses because vLLM's text decoding strips the special <|channel|> tokens (skip_special_tokens=True) before the reasoning parser sees the text. The parser defines start/end tokens as text properties, but unlike Qwen3ReasoningParser, it does not implement start_token_id/end_token_id for token-level matching in the streaming path. --> This affects both streaming and non-streaming.
Tool calling broken with Claude Code / agentic use (OpenCode confirmed too)
There are reports of tool calling problems with Gemma4 served via vLLM when used with Claude Code.

So where exactly, git reference would be nice, are these active issues tracked?

srikanta-221

Google org 5 days ago

Hey there, really appreciate the detailed follow up, this is exactly kind of signal that helps improve ecosystem.

On sampling point, that's a fair callout. The recommendations in the Gemma 4 model card are the standardised defaults the model was tuned with, so sticking to those especially for evaluation is completely reasonable. In practice, people sometimes still adjust decoding for specific workloads like code generation, but it makes sense that in your case it didn't address the issue. I apologise if it caused any confusion.

Also good to know you are running bf16, that clearly rules out any quantization related artifacts from AWQ quantization, so we can focus entirely on the serving stack behavior.

Based on what you’ve outlined, your observations align with other reports in early Gemma 4 integrations. The model introduces some architectural differences that don’t fully match the assumptions in existing serving frameworks, which can sometimes affect both performance and correctness.

The heterogeneous attention setup you mentioned can indeed force fallback paths in some backends today, which impacts both throughput and kernel selection. Similarly, the reasoning parser behavior you described where special tokens are stripped before parsing matches known gap in how token-level vs text-level parsing is currently handled, particularly in streaming scenarios.

On the tool calling side, what you are seeing with HTML duplication and instability in agentic workflows also aligns with broader reports around parser behavior when handling structured outputs. This is especially visible in integration like Claude Code/OpenCode where tool usage is central, so I completely understand why disabling tool calling isn't an option for you.

This doesn't point to a fundamental issue with the model itself, but rather a set of active integration gaps that are still being worked through as support for Gemma 4 matures in serving stacks.

If you haven't already, it would be very helpful to raise a focused issue directly in the vLLM Repository with minimal repro, especially capturing the tool calling + HTML case. The level detail you have shared here is exactly what maintainers need to prioritise and resolve these quickly.

You may also want to go through the existing issues in the vLLM repository that relate to the behaviours mentioned above, as there are ongoing discussions and updates that might be relevant to your setup.
Appreciate you digging into this so deeply, this kind of feedback is incredibly valuable.

rirv938

5 days ago

•

edited 5 days ago

I am also getting weird generations

I see our evaluation metrics tank by a massive margin when I go from using GCP Vertex AI version of gemma 4 and move to vllm + gemma 4

im using the recommended formatting and generation params but see things like

"he doesn't even torightly look at the screen"

"He stays exactly where you're lean against him"

im not doing any quantization etc. and fairly vanilla settings

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment