Please update tokenizer config as well

by alexcardo - opened 2 days ago

•

Looks like you also need to update tokenizer config

alexcardo changed discussion title from Unfortunately, this model is broken ( to Unfortunately, this model is broken (not quant) 2 days ago

alexcardo changed discussion title from Unfortunately, this model is broken (not quant) to Please update tokenizer config as well 2 days ago

bdellabe

Red Hat AI org 2 days ago

Sorry about that, updated. tested locally and getting coherent output

alexcardo

2 days ago

•

edited 2 days ago

I deleted my message hoping that tokenizer config is the solution, yet no! On the long context you will finally see lalala infinite loop. Can you please provide your exact running vLLM command for this qant as well as global parameters such as top_k, top_p, temperature.

Tanx!

These are mine:

temperature=1.0,top_p=0.95,top_k=64

python3 -m vllm.entrypoints.openai.api_server
--model RedHatAI/gemma-4-31B-it-NVFP4
--max-model-len 212992
--reasoning-parser gemma4
--tool-call-parser gemma4
--enable-auto-tool-choice
--host 0.0.0.0
--port 8080
--enable-prefix-caching
--kv-cache-dtype fp8
--max-num-seqs 18
--gpu-memory-utilization 0.96
--trust-remote-code

vllm/vllm-openai:gemma4-cu130 (docker)

bdellabe

Red Hat AI org 2 days ago

Hi @alexcardo , I was just running a simple sanity check, nowhere near the max-model-len you have set. The model card links to vllm docs showing suggested serve settings. Can you try RedHatAI/gemma-4-31B-it-FP8-block and see if you hit the same long-context issue? Perhaps this is a limitation of round-to-nearest NVFP4 quantization with a very small calibration dataset

bdellabe

Red Hat AI org 2 days ago

Also can you confirm this isn't happening for you on the original google/gemma-4-31B-it model?

alexcardo

2 days ago

It's a Google AI Studio interface

alexcardo

2 days ago

•

edited 2 days ago

This text I want model to translate...

Now they fixed this in the AI Studio. But they probably didn't update the model weigths. You may receive the correct translation in one iteration. But try more times.

bdellabe

Red Hat AI org 2 days ago

Please try with RedHatAI/gemma-4-31B-it-FP8-block and post if that is any better. It may be that a round-to-nearest NVFP4 model is insufficient for your needs at long context length, and not an indication that anything is broken

alexcardo

2 days ago

•

edited 2 days ago

Please try with RedHatAI/gemma-4-31B-it-FP8-block and post if that is any better. It may be that a round-to-nearest NVFP4 model is insufficient for your needs at long context length, and not an indication that anything is broken

Thank you for your response, but it's clear from my message about that the same model behavior relates to the original model. My screenshots are from the Google AI Studio which means that the model is not qauntized at all here.

Ok, let's just wait when other people report the issue.

Anyway, thanks for your FNFP4 qant which fits in the consumer video cards!

bdellabe

Red Hat AI org 2 days ago

got it, thanks for the point of clarification.

dotmobo

1 day ago

hi thanks for the update ! @alexcardo do you have some issue with the fp8 kv cache or not ? I'm thinking about using it.

alexcardo

1 day ago

hi thanks for the update ! @alexcardo do you have some issue with the fp8 kv cache or not ? I'm thinking about using it.

I'd suggest you to try this model yourself. I have multiple issues with this model (image recognition, tool calling, infinite loops). I've been trying the original weights and quantized. My use case is deep research so these point are crucial for me. My opinion is that the original model weight have these issues (look at my screen from the Google AI Studio above). Yet, my best suggestion is that you have your personal experience.

I have multiple bugs collected and tested this model for days and night since its release. But now I'd rater keep my opinion inside my mind. I want to stop share it. I want other people (not only me) report the issues. I see that people on Localllama sub Reddit faced the same. But I'll rather keep silent.

If you want my PERSONAL opinion, the model itself is not ready for the production usage either original or quantized. But this is my PERSONAL opinion.

dotmobo

1 day ago

Alright, I'll try it with the FP8 KV cache and see for myself. I'll let you know if I encounter the same infinite loop issues you mentioned (I haven't had any so far). The only problems I've run into are minor glitches, like a few words in another language appearing in the middle of a paragraph. Overall, people seem to prefer the 'feel' of this model compared to GPT-OSS or Qwen, but the fact that it's not yet production-ready is a bit concerning.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment