Qwen3.5-122B-A10B-heretic-MTP-NVFP4

NVFP4 (W4A4) quantization of trohrbaugh/Qwen3.5-122B-A10B-heretic.

Base Model: Qwen3.5-122B-A10B-heretic (MoE: 122B total, ~10B active, abliterated with KL ~0.09)
Quantization: NVFP4 (W4A4) — weights and activations
Size: 76GB (16 shards + MTP shard + visual shard)
Quantized: Language backbone MoE expert and attention layers
NOT Quantized: Vision encoder, merger, LM head, embed tokens, linear attention, MoE gates, MTP heads (remain BF16)
MTP: Working speculative decoding — 785 tensors spliced from base Qwen/Qwen3.5-122B-A10B in BF16
Tokenizer: From base Qwen3.5

Usage with vLLM

Tested on vLLM 0.19+.

vllm serve OptimizeLLM/Qwen3.5-122B-A10B-heretic-MTP-NVFP4 \
  --tensor-parallel-size 2 \
  --reasoning-parser qwen3 \
  --reasoning-config '{}' \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --enable-prefix-caching \
  --speculative-config '{"method":"mtp","num_speculative_tokens":6}'

MTP Throughput (2x RTX 6000 Pro Blackwell, TP=2)

MTP Speculative Tokens	tok/s
0 (disabled)	~105
1	~115
2	~145
3	~170
6	~190

Quantization

Quantized with llm-compressor (compressed-tensors v0.14.1.dev28).

Format: nvfp4-pack-quantized
Weight/Activation bits: FP4 E2M1
Scale dtype: float8_e4m3fn
Group size: 16
Calibration: 512 samples (256 UltraChat + 256 Nemotron-CC chat split)
MoE calibration: moe_calibrate_all_experts=True
Ignore list: Aligned with RedHatAI/Qwen3.5-122B-A10B-NVFP4

MTP Splicing

The upstream heretic quant quantized MTP heads to FP4, which breaks speculative decoding. All 785 mtp.* tensors were extracted from base Qwen/Qwen3.5-122B-A10B in BF16 and saved as a separate model_mtp.safetensors shard (4.8GB).

Notes for Reproducing

MoE unfusing roughly doubles memory per layer during calibration. GPU calibration will likely OOM. CPU-only calibration with ~400GB swap worked (peak ~445GB RSS+swap, ~3 days on 64 cores).
llm-compressor's use_auth_token kwarg crashes on transformers 5.x — patch to token.
offload_folder is required even without disk offload (transformers reformats fused expert tensors during loading).
ctypes.CDLL("libc.so.6").mallopt(-1, 0) before loading prevents glibc malloc arena bloat.

Acknowledgments

Qwen Team for Qwen3.5-122B-A10B
trohrbaugh for the heretic abliteration
RedHatAI for the reference NVFP4 recipe
vLLM and llm-compressor teams

Downloads last month: 1,447

Safetensors

Model size

74B params

Tensor type

F32

BF16

F8_E4M3

Model tree for OptimizeLLM/Qwen3.5-122B-A10B-heretic-MTP-NVFP4

Base model

Qwen/Qwen3.5-122B-A10B

Quantized

(104)

this model