Qwen3.6-27B-NVFP4

NVFP4 quantized version of Qwen/Qwen3.6-27B.

55.6 GB → 20.6 GB (0.37x) with vision tower and MTP draft head preserved in BF16. Tested on NVIDIA DGX Spark (GB10, SM 121).

NVFP4 Quantization Details


Base model	`Qwen/Qwen3.6-27B`
Quantization	NVFP4 (W4A4 — weights FP4, activations FP4, scales FP8)
Format	`compressed-tensors` (native vLLM support)
Tool	`vllm-project/llm-compressor`
Calibration	`nvidia/Nemotron-Post-Training-Dataset-v2` (512 samples)
Container	`eugr/spark-vllm-docker`
Size	20.6 GB (quantized shard + BF16 MTP shard)
Requires	NVIDIA Blackwell GPU (SM 120+), vLLM >= 0.19

Recipe

QuantizationModifier:
  targets: [Linear]
  ignore: [lm_head, 're:.*visual.*', 're:.*mlp.gate$', 're:.*mlp.shared_expert_gate$']
  scheme: NVFP4

What's Quantized / What's Not

Quantized (NVFP4): All Linear layers in the language model
Kept in BF16: lm_head, all vision layers (model.visual.*), MLP gates, MTP draft head (mtp.*)

MTP Speculative Decoding

The MTP draft head (mtp.fc, mtp.layers.0.*) is kept in BF16 and shipped as a separate model-mtp-bf16.safetensors shard. Quantizing the draft head to FP8/FP4 lowers acceptance rate; BF16 is the typical choice for Qwen3.5-series NVFP4 checkpoints.

Enable via vLLM:

--speculative-config.method mtp --speculative-config.num_speculative_tokens 3

Evaluation

humaneval_instruct_chat (lm-evaluation-harness, 0-shot, extract_code filter):

Model	Metric	Value	Stderr
Qwen3.6-27B (BF16)	pass@1	0.9817	±0.0105
Qwen3.6-27B-NVFP4	pass@1	0.9695	±0.0135

Recovery: 98.76% of BF16 pass@1 (0.9695 / 0.9817).

Quick Start (vLLM)

vllm serve ocicek/Qwen3.6-27B-NVFP4 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.92 \
  --dtype auto \
  --trust-remote-code

Tested Environment

Component	Version
vLLM	0.19.2rc1
Transformers	5.x
PyTorch	2.11.0+cu130
GPU	NVIDIA DGX Spark (GB10, SM 121)

Credits

Original model: Qwen Team (Alibaba Group)
Quantization framework: vllm-project/llm-compressor
DGX Spark vLLM container: eugr/spark-vllm-docker

Downloads last month: 3,880

Safetensors

Model size

17B params

Tensor type

F32

BF16

F8_E4M3

Model tree for ocicek/Qwen3.6-27B-NVFP4

Base model

Qwen/Qwen3.6-27B

Quantized

(277)

this model