MTP and calibration dataset
First off - really appreciate you making this quant available. Really great model and great quant.
That said it looks like mtp layers aren't present in this quant. How much of a PITA would it be to update the model w/the MTP layers? (Sehyo went through the same nvfp4 quanting base Qwen3.5 family).
Alternatively, could you make the calibration dataset you used available so people can reproduce your quant w/those layers intact?
I have updated the model to have the MTP head. The MTP heads are not quantized - they take only like half gb ish and would be better for performance if stayed in BF16.
This command worked for me locally:
CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 \
vllm serve /mnt/nas10g/models/Qwen3.5-27B-Distilled-NVFP4-Mixed-v2 \
--port 8001 \
--reasoning-parser qwen3 \
--max-model-len 8192 \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'
Awesome! I'll pull it back down and try it out. Hoping nvfp4 can beat QuantTrio's AWQ but not sure the nvfp4 kernel will do it. /sad