Please update tokenizer config as well

#2
by alexcardo - opened

Looks like you also need to update tokenizer config

alexcardo changed discussion title from Unfortunately, this model is broken ( to Unfortunately, this model is broken (not quant)
alexcardo changed discussion title from Unfortunately, this model is broken (not quant) to Please update tokenizer config as well
Red Hat AI org

Sorry about that, updated. tested locally and getting coherent output

I deleted my message hoping that tokenizer config is the solution, yet no! On the long context you will finally see lalala infinite loop. Can you please provide your exact running vLLM command for this qant as well as global parameters such as top_k, top_p, temperature.

Tanx!

These are mine:

temperature=1.0,top_p=0.95,top_k=64

python3 -m vllm.entrypoints.openai.api_server
--model RedHatAI/gemma-4-31B-it-NVFP4
--max-model-len 212992
--reasoning-parser gemma4
--tool-call-parser gemma4
--enable-auto-tool-choice
--host 0.0.0.0
--port 8080
--enable-prefix-caching
--kv-cache-dtype fp8
--max-num-seqs 18
--gpu-memory-utilization 0.96
--trust-remote-code

vllm/vllm-openai:gemma4-cu130 (docker)

Red Hat AI org

Hi @alexcardo , I was just running a simple sanity check, nowhere near the max-model-len you have set. The model card links to vllm docs showing suggested serve settings. Can you try RedHatAI/gemma-4-31B-it-FP8-block and see if you hit the same long-context issue? Perhaps this is a limitation of round-to-nearest NVFP4 quantization with a very small calibration dataset

Red Hat AI org

Also can you confirm this isn't happening for you on the original google/gemma-4-31B-it model?

Снимок экрана 2026-04-10 в 22.53.03

Снимок экрана 2026-04-10 в 22.53.14

It's a Google AI Studio interface

abaci

This text I want model to translate...

Now they fixed this in the AI Studio. But they probably didn't update the model weigths. You may receive the correct translation in one iteration. But try more times.

Red Hat AI org

Please try with RedHatAI/gemma-4-31B-it-FP8-block and post if that is any better. It may be that a round-to-nearest NVFP4 model is insufficient for your needs at long context length, and not an indication that anything is broken

Please try with RedHatAI/gemma-4-31B-it-FP8-block and post if that is any better. It may be that a round-to-nearest NVFP4 model is insufficient for your needs at long context length, and not an indication that anything is broken

Thank you for your response, but it's clear from my message about that the same model behavior relates to the original model. My screenshots are from the Google AI Studio which means that the model is not qauntized at all here.

Ok, let's just wait when other people report the issue.

Anyway, thanks for your FNFP4 qant which fits in the consumer video cards!

Red Hat AI org

got it, thanks for the point of clarification.

hi thanks for the update ! @alexcardo do you have some issue with the fp8 kv cache or not ? I'm thinking about using it.

hi thanks for the update ! @alexcardo do you have some issue with the fp8 kv cache or not ? I'm thinking about using it.

I'd suggest you to try this model yourself. I have multiple issues with this model (image recognition, tool calling, infinite loops). I've been trying the original weights and quantized. My use case is deep research so these point are crucial for me. My opinion is that the original model weight have these issues (look at my screen from the Google AI Studio above). Yet, my best suggestion is that you have your personal experience.

I have multiple bugs collected and tested this model for days and night since its release. But now I'd rater keep my opinion inside my mind. I want to stop share it. I want other people (not only me) report the issues. I see that people on Localllama sub Reddit faced the same. But I'll rather keep silent.

If you want my PERSONAL opinion, the model itself is not ready for the production usage either original or quantized. But this is my PERSONAL opinion.

Alright, I'll try it with the FP8 KV cache and see for myself. I'll let you know if I encounter the same infinite loop issues you mentioned (I haven't had any so far). The only problems I've run into are minor glitches, like a few words in another language appearing in the middle of a paragraph. Overall, people seem to prefer the 'feel' of this model compared to GPT-OSS or Qwen, but the fact that it's not yet production-ready is a bit concerning.

Sign up or log in to comment