Qwen3.6-27B-NVFP4
NVFP4 quantized version of Qwen/Qwen3.6-27B.
55.6 GB → 20.6 GB (0.37x) with vision tower and MTP draft head preserved in BF16. Tested on NVIDIA DGX Spark (GB10, SM 121).
NVFP4 Quantization Details
| Base model | Qwen/Qwen3.6-27B |
| Quantization | NVFP4 (W4A4 — weights FP4, activations FP4, scales FP8) |
| Format | compressed-tensors (native vLLM support) |
| Tool | vllm-project/llm-compressor |
| Calibration | nvidia/Nemotron-Post-Training-Dataset-v2 (512 samples) |
| Container | eugr/spark-vllm-docker |
| Size | 20.6 GB (quantized shard + BF16 MTP shard) |
| Requires | NVIDIA Blackwell GPU (SM 120+), vLLM >= 0.19 |
Recipe
QuantizationModifier:
targets: [Linear]
ignore: [lm_head, 're:.*visual.*', 're:.*mlp.gate$', 're:.*mlp.shared_expert_gate$']
scheme: NVFP4
What's Quantized / What's Not
- Quantized (NVFP4): All
Linearlayers in the language model - Kept in BF16:
lm_head, all vision layers (model.visual.*), MLP gates, MTP draft head (mtp.*)
MTP Speculative Decoding
The MTP draft head (mtp.fc, mtp.layers.0.*) is kept in BF16 and shipped as a
separate model-mtp-bf16.safetensors shard. Quantizing the draft head to FP8/FP4
lowers acceptance rate; BF16 is the typical choice for Qwen3.5-series NVFP4 checkpoints.
Enable via vLLM:
--speculative-config.method mtp --speculative-config.num_speculative_tokens 3
Evaluation
humaneval_instruct_chat (lm-evaluation-harness, 0-shot, extract_code filter):
| Model | Metric | Value | Stderr |
|---|---|---|---|
| Qwen3.6-27B (BF16) | pass@1 | 0.9817 | ±0.0105 |
| Qwen3.6-27B-NVFP4 | pass@1 | 0.9695 | ±0.0135 |
Recovery: 98.76% of BF16 pass@1 (0.9695 / 0.9817).
Quick Start (vLLM)
vllm serve ocicek/Qwen3.6-27B-NVFP4 \
--max-model-len 8192 \
--gpu-memory-utilization 0.92 \
--dtype auto \
--trust-remote-code
Tested Environment
| Component | Version |
|---|---|
| vLLM | 0.19.2rc1 |
| Transformers | 5.x |
| PyTorch | 2.11.0+cu130 |
| GPU | NVIDIA DGX Spark (GB10, SM 121) |
Credits
- Original model: Qwen Team (Alibaba Group)
- Quantization framework:
vllm-project/llm-compressor - DGX Spark vLLM container:
eugr/spark-vllm-docker
- Downloads last month
- 3,880
Model tree for ocicek/Qwen3.6-27B-NVFP4
Base model
Qwen/Qwen3.6-27B