Qwen3.6-27B-NVFP4

NVFP4 quantized version of Qwen/Qwen3.6-27B.

55.6 GB → 20.6 GB (0.37x) with vision tower and MTP draft head preserved in BF16. Tested on NVIDIA DGX Spark (GB10, SM 121).

NVFP4 Quantization Details

Base model Qwen/Qwen3.6-27B
Quantization NVFP4 (W4A4 — weights FP4, activations FP4, scales FP8)
Format compressed-tensors (native vLLM support)
Tool vllm-project/llm-compressor
Calibration nvidia/Nemotron-Post-Training-Dataset-v2 (512 samples)
Container eugr/spark-vllm-docker
Size 20.6 GB (quantized shard + BF16 MTP shard)
Requires NVIDIA Blackwell GPU (SM 120+), vLLM >= 0.19

Recipe

QuantizationModifier:
  targets: [Linear]
  ignore: [lm_head, 're:.*visual.*', 're:.*mlp.gate$', 're:.*mlp.shared_expert_gate$']
  scheme: NVFP4

What's Quantized / What's Not

  • Quantized (NVFP4): All Linear layers in the language model
  • Kept in BF16: lm_head, all vision layers (model.visual.*), MLP gates, MTP draft head (mtp.*)

MTP Speculative Decoding

The MTP draft head (mtp.fc, mtp.layers.0.*) is kept in BF16 and shipped as a separate model-mtp-bf16.safetensors shard. Quantizing the draft head to FP8/FP4 lowers acceptance rate; BF16 is the typical choice for Qwen3.5-series NVFP4 checkpoints.

Enable via vLLM:

--speculative-config.method mtp --speculative-config.num_speculative_tokens 3

Evaluation

humaneval_instruct_chat (lm-evaluation-harness, 0-shot, extract_code filter):

Model Metric Value Stderr
Qwen3.6-27B (BF16) pass@1 0.9817 ±0.0105
Qwen3.6-27B-NVFP4 pass@1 0.9695 ±0.0135

Recovery: 98.76% of BF16 pass@1 (0.9695 / 0.9817).

Quick Start (vLLM)

vllm serve ocicek/Qwen3.6-27B-NVFP4 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.92 \
  --dtype auto \
  --trust-remote-code

Tested Environment

Component Version
vLLM 0.19.2rc1
Transformers 5.x
PyTorch 2.11.0+cu130
GPU NVIDIA DGX Spark (GB10, SM 121)

Credits

Downloads last month
3,880
Safetensors
Model size
17B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ocicek/Qwen3.6-27B-NVFP4

Base model

Qwen/Qwen3.6-27B
Quantized
(277)
this model