Qwen3.5-122B-A10B-heretic-MTP-NVFP4
NVFP4 (W4A4) quantization of trohrbaugh/Qwen3.5-122B-A10B-heretic.
- Base Model: Qwen3.5-122B-A10B-heretic (MoE: 122B total, ~10B active, abliterated with KL ~0.09)
- Quantization: NVFP4 (W4A4) — weights and activations
- Size: 76GB (16 shards + MTP shard + visual shard)
- Quantized: Language backbone MoE expert and attention layers
- NOT Quantized: Vision encoder, merger, LM head, embed tokens, linear attention, MoE gates, MTP heads (remain BF16)
- MTP: Working speculative decoding — 785 tensors spliced from base Qwen/Qwen3.5-122B-A10B in BF16
- Tokenizer: From base Qwen3.5
Usage with vLLM
Tested on vLLM 0.19+.
vllm serve OptimizeLLM/Qwen3.5-122B-A10B-heretic-MTP-NVFP4 \
--tensor-parallel-size 2 \
--reasoning-parser qwen3 \
--reasoning-config '{}' \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--enable-prefix-caching \
--speculative-config '{"method":"mtp","num_speculative_tokens":6}'
MTP Throughput (2x RTX 6000 Pro Blackwell, TP=2)
| MTP Speculative Tokens | tok/s |
|---|---|
| 0 (disabled) | ~105 |
| 1 | ~115 |
| 2 | ~145 |
| 3 | ~170 |
| 6 | ~190 |
Quantization
Quantized with llm-compressor (compressed-tensors v0.14.1.dev28).
- Format:
nvfp4-pack-quantized - Weight/Activation bits: FP4 E2M1
- Scale dtype:
float8_e4m3fn - Group size: 16
- Calibration: 512 samples (256 UltraChat + 256 Nemotron-CC chat split)
- MoE calibration:
moe_calibrate_all_experts=True - Ignore list: Aligned with RedHatAI/Qwen3.5-122B-A10B-NVFP4
MTP Splicing
The upstream heretic quant quantized MTP heads to FP4, which breaks speculative decoding. All 785 mtp.* tensors were extracted from base Qwen/Qwen3.5-122B-A10B in BF16 and saved as a separate model_mtp.safetensors shard (4.8GB).
Notes for Reproducing
- MoE unfusing roughly doubles memory per layer during calibration. GPU calibration will likely OOM. CPU-only calibration with ~400GB swap worked (peak ~445GB RSS+swap, ~3 days on 64 cores).
- llm-compressor's
use_auth_tokenkwarg crashes on transformers 5.x — patch totoken. offload_folderis required even without disk offload (transformers reformats fused expert tensors during loading).ctypes.CDLL("libc.so.6").mallopt(-1, 0)before loading prevents glibc malloc arena bloat.
Acknowledgments
- Qwen Team for Qwen3.5-122B-A10B
- trohrbaugh for the heretic abliteration
- RedHatAI for the reference NVFP4 recipe
- vLLM and llm-compressor teams
- Downloads last month
- 1,447
Model tree for OptimizeLLM/Qwen3.5-122B-A10B-heretic-MTP-NVFP4
Base model
Qwen/Qwen3.5-122B-A10B