Great quant!!

by tasticleeze - opened 16 days ago

Discussion

tasticleeze

16 days ago

•

edited 16 days ago

Hey, RedHat team — this has become my primary agentic coder, chatbot, vane searcher. Thank you for your hard work!

If anyone has advice on shoring up the model’s tool call reliability, would love to hear!

For anyone interested, I’ve tried to squeeze the most out of my 32gb vram:

vllm serve
--host 0.0.0.0
--port 8000
--pipeline-parallel-size 1
--tensor-parallel-size 2
--enable-prefix-caching
--model /model/Qwen3.6-35B-A3B-NVFP4
--served-model-name Qwen3.6-35B
--kv-cache-dtype fp8_e4m3
--gpu-memory-utilization 0.912
--max-model-len 262144
--max-num-seqs 5
--max-num-batched-tokens 8192
--reasoning-parser qwen3
--tool-call-parser qwen3_coder
--enable-auto-tool-choice
--enable-prefix-caching
--enable-chunked-prefill
--enable-expert-parallel
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'
--default-chat-template-kwargs '{"preserve_thinking":true}'
--limit-mm-per-prompt '{"image": 0, "video": 0, "audio": 0}'
--trust-remote-code

livepeer-ren

16 days ago

which vllm version are u using? i cant make it work withj latest vllm docker image i had to remove --speculative-config '{"method":"mtp","num_speculative_tokens":1}'to fit it

tasticleeze

15 days ago

•

edited 15 days ago

I am using vllm/vllm-openai:cu130-nightly image, VLLM version 0.19.2rc1.dev107+g4eafc7292.

Spec config takes like .5gb I believe — perhaps lowering max batched tokens or num-seqs would help? If you’re doing text only, I believe you save some context space with --limit-mm-per-prompt '{"image": 0, "video": 0, "audio": 0}' as well.

For full context, I am using x2 Nvidia 5060ti 16gb variants.

dnum-ia-unistra

15 days ago

So MTP is currently broken with that version ?

tasticleeze

15 days ago

Mtp does currently work for this quant on vllm/vllm-openai:cu130-nightly.

livepeer-ren

15 days ago

I am using vllm/vllm-openai:cu130-nightly image, VLLM version 0.19.2rc1.dev107+g4eafc7292.

Spec config takes like .5gb I believe — perhaps lowering max batched tokens or num-seqs would help? If you’re doing text only, I believe you save some context space with --limit-mm-per-prompt '{"image": 0, "video": 0, "audio": 0}' as well.

For full context, I am using x2 Nvidia 5060ti 16gb variants.

I can confirm latest vllm/vllm-openai:cu130-nightly works with the above flags!! I couldnt fit mtp flag on my rtx 5090 with vllm:latest. So I can run 260k full context with MTP enabled. awesome

livepeer-ren

15 days ago

qwen 3.6 27b is out hahahahah here we go again

Nooalt

14 days ago

Is it compatible with SGLang?

dsikka

Red Hat AI org 14 days ago

Our models are in the compressed-tensors format which has much better support in vLLM and it is also our primary target for inference serving.
If you're seeing any particular issues with vLLM and NVFP4, feel free to share in the vLLM slack: https://communityinviter.com/apps/vllm-dev/join-vllm-developers-slack!

cahlen

12 days ago

yea i never really leave comments on models. but kudos to you and your team, this quant rips on my 5090. great code generation and agentic recursion.

kvazdopil

12 days ago

@dsikka this is insanely good quant, thank you so much to you and your team for releasing it, it's my daily driver now.

Would you be so kind to share - do you folks have any plans on releasing Qwen 3.6 27b in NVFP4 variant? Cause I trust only you in this matter :)

livepeer-ren

12 days ago

•

edited 10 days ago

@dsikka this is insanely good quant, thank you so much to you and your team for releasing it, it's my daily driver now.

Would you be so kind to share - do you folks have any plans on releasing Qwen 3.6 27b in NVFP4 variant? Cause I trust only you in this matter :)

Yes I am also waiting for 27B by @redhatai meanwhile I am using this https://huggingface.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP 120tk/sec+ with 3 mtp tokens!!

feilonguu

3 days ago

@dsikka this is insanely good quant, thank you so much to you and your team for releasing it, it's my daily driver now.

Would you be so kind to share - do you folks have any plans on releasing Qwen 3.6 27b in NVFP4 variant? Cause I trust only you in this matter :)

Me too！

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment