Qwen3.5-122B-A10B-heretic-MTP-NVFP4

NVFP4 (W4A4) quantization of trohrbaugh/Qwen3.5-122B-A10B-heretic.

  • Base Model: Qwen3.5-122B-A10B-heretic (MoE: 122B total, ~10B active, abliterated with KL ~0.09)
  • Quantization: NVFP4 (W4A4) — weights and activations
  • Size: 76GB (16 shards + MTP shard + visual shard)
  • Quantized: Language backbone MoE expert and attention layers
  • NOT Quantized: Vision encoder, merger, LM head, embed tokens, linear attention, MoE gates, MTP heads (remain BF16)
  • MTP: Working speculative decoding — 785 tensors spliced from base Qwen/Qwen3.5-122B-A10B in BF16
  • Tokenizer: From base Qwen3.5

Usage with vLLM

Tested on vLLM 0.19+.

vllm serve OptimizeLLM/Qwen3.5-122B-A10B-heretic-MTP-NVFP4 \
  --tensor-parallel-size 2 \
  --reasoning-parser qwen3 \
  --reasoning-config '{}' \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --enable-prefix-caching \
  --speculative-config '{"method":"mtp","num_speculative_tokens":6}'

MTP Throughput (2x RTX 6000 Pro Blackwell, TP=2)

MTP Speculative Tokens tok/s
0 (disabled) ~105
1 ~115
2 ~145
3 ~170
6 ~190

Quantization

Quantized with llm-compressor (compressed-tensors v0.14.1.dev28).

  • Format: nvfp4-pack-quantized
  • Weight/Activation bits: FP4 E2M1
  • Scale dtype: float8_e4m3fn
  • Group size: 16
  • Calibration: 512 samples (256 UltraChat + 256 Nemotron-CC chat split)
  • MoE calibration: moe_calibrate_all_experts=True
  • Ignore list: Aligned with RedHatAI/Qwen3.5-122B-A10B-NVFP4

MTP Splicing

The upstream heretic quant quantized MTP heads to FP4, which breaks speculative decoding. All 785 mtp.* tensors were extracted from base Qwen/Qwen3.5-122B-A10B in BF16 and saved as a separate model_mtp.safetensors shard (4.8GB).

Notes for Reproducing

  • MoE unfusing roughly doubles memory per layer during calibration. GPU calibration will likely OOM. CPU-only calibration with ~400GB swap worked (peak ~445GB RSS+swap, ~3 days on 64 cores).
  • llm-compressor's use_auth_token kwarg crashes on transformers 5.x — patch to token.
  • offload_folder is required even without disk offload (transformers reformats fused expert tensors during loading).
  • ctypes.CDLL("libc.so.6").mallopt(-1, 0) before loading prevents glibc malloc arena bloat.

Acknowledgments

Downloads last month
1,447
Safetensors
Model size
74B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OptimizeLLM/Qwen3.5-122B-A10B-heretic-MTP-NVFP4

Quantized
(104)
this model