Great quant!!

#6
by tasticleeze - opened

Hey, RedHat team — this has become my primary agentic coder, chatbot, vane searcher. Thank you for your hard work!

If anyone has advice on shoring up the model’s tool call reliability, would love to hear!

For anyone interested, I’ve tried to squeeze the most out of my 32gb vram:

vllm serve
--host 0.0.0.0
--port 8000
--pipeline-parallel-size 1
--tensor-parallel-size 2
--enable-prefix-caching
--model /model/Qwen3.6-35B-A3B-NVFP4
--served-model-name Qwen3.6-35B
--kv-cache-dtype fp8_e4m3
--gpu-memory-utilization 0.912
--max-model-len 262144
--max-num-seqs 5
--max-num-batched-tokens 8192
--reasoning-parser qwen3
--tool-call-parser qwen3_coder
--enable-auto-tool-choice
--enable-prefix-caching
--enable-chunked-prefill
--enable-expert-parallel
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'
--default-chat-template-kwargs '{"preserve_thinking":true}'
--limit-mm-per-prompt '{"image": 0, "video": 0, "audio": 0}'
--trust-remote-code

which vllm version are u using? i cant make it work withj latest vllm docker image i had to remove --speculative-config '{"method":"mtp","num_speculative_tokens":1}'to fit it

I am using vllm/vllm-openai:cu130-nightly image, VLLM version 0.19.2rc1.dev107+g4eafc7292.

Spec config takes like .5gb I believe — perhaps lowering max batched tokens or num-seqs would help? If you’re doing text only, I believe you save some context space with --limit-mm-per-prompt '{"image": 0, "video": 0, "audio": 0}' as well.

For full context, I am using x2 Nvidia 5060ti 16gb variants.

So MTP is currently broken with that version ?

Mtp does currently work for this quant on vllm/vllm-openai:cu130-nightly.

I am using vllm/vllm-openai:cu130-nightly image, VLLM version 0.19.2rc1.dev107+g4eafc7292.

Spec config takes like .5gb I believe — perhaps lowering max batched tokens or num-seqs would help? If you’re doing text only, I believe you save some context space with --limit-mm-per-prompt '{"image": 0, "video": 0, "audio": 0}' as well.

For full context, I am using x2 Nvidia 5060ti 16gb variants.

I can confirm latest vllm/vllm-openai:cu130-nightly works with the above flags!! I couldnt fit mtp flag on my rtx 5090 with vllm:latest. So I can run 260k full context with MTP enabled. awesome

qwen 3.6 27b is out hahahahah here we go again

Is it compatible with SGLang?

Red Hat AI org

Our models are in the compressed-tensors format which has much better support in vLLM and it is also our primary target for inference serving.
If you're seeing any particular issues with vLLM and NVFP4, feel free to share in the vLLM slack: https://communityinviter.com/apps/vllm-dev/join-vllm-developers-slack!

yea i never really leave comments on models. but kudos to you and your team, this quant rips on my 5090. great code generation and agentic recursion.

@dsikka this is insanely good quant, thank you so much to you and your team for releasing it, it's my daily driver now.

Would you be so kind to share - do you folks have any plans on releasing Qwen 3.6 27b in NVFP4 variant? Cause I trust only you in this matter :)

@dsikka this is insanely good quant, thank you so much to you and your team for releasing it, it's my daily driver now.

Would you be so kind to share - do you folks have any plans on releasing Qwen 3.6 27b in NVFP4 variant? Cause I trust only you in this matter :)

Yes I am also waiting for 27B by @redhatai meanwhile I am using this https://huggingface.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP 120tk/sec+ with 3 mtp tokens!!

@dsikka this is insanely good quant, thank you so much to you and your team for releasing it, it's my daily driver now.

Would you be so kind to share - do you folks have any plans on releasing Qwen 3.6 27b in NVFP4 variant? Cause I trust only you in this matter :)

Me too!

Sign up or log in to comment