mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4

MTP and calibration dataset

by jmckenzie-dev - opened Mar 18

Mar 18

First off - really appreciate you making this quant available. Really great model and great quant.

That said it looks like mtp layers aren't present in this quant. How much of a PITA would it be to update the model w/the MTP layers? (Sehyo went through the same nvfp4 quanting base Qwen3.5 family).

Alternatively, could you make the calibration dataset you used available so people can reproduce your quant w/those layers intact?

mconcat

Owner Mar 18

I have updated the model to have the MTP head. The MTP heads are not quantized - they take only like half gb ish and would be better for performance if stayed in BF16.

This command worked for me locally:

CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 \
vllm serve /mnt/nas10g/models/Qwen3.5-27B-Distilled-NVFP4-Mixed-v2 \
    --port 8001 \
    --reasoning-parser qwen3 \
    --max-model-len 8192 \
    --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

jmckenzie-dev

Mar 18

Awesome! I'll pull it back down and try it out. Hoping nvfp4 can beat QuantTrio's AWQ but not sure the nvfp4 kernel will do it. /sad

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment