Qwen3.5-35B-A3B-heretic-v2-NVFP4-W4A4

Calibrated NVFP4 W4A4 quantization of llmfan46/Qwen3.5-35B-A3B-heretic-v2 (abliterated via Heretic).

Both weights and activations are quantized to FP4 (E2M1). Activation scales are calibrated from 256 samples of HuggingFaceH4/ultrachat_200k.

Quantization Details

Parameter Value
Method NVFP4 W4A4 — calibrated weight + activation FP4
Weight dtype E2M1 (4-bit float), packed uint8
Block scale FP8 E4M3, one per 16 elements
Global scale FP32, one per tensor
Activation quantization FP4 dynamic, with calibrated input_global_scale
Format nvfp4-pack-quantized (compressed-tensors)
Calibration 256 samples, HuggingFaceH4/ultrachat_200k, seq_len=4096
Original size ~66 GB (BF16)
VRAM at load 21.88 GiB

What's quantized (FP4)

All MLP expert projections (gate_proj, up_proj, down_proj for 256 experts + shared expert), attention projections (q/k/v/o_proj).

What's kept in BF16

GatedDeltaNet (linear attention) layers, vision tower, norms, embeddings, lm_head, MoE router gates.

Quality (JSD vs FP8)

Evaluated using Jensen-Shannon Divergence with forced-decode on 2,590 tokens across 12 prompts (chat, reasoning, code, creative). Reference model: FP8 block-wise quantization of the same base model.

Category Mean JSD Assessment
Reasoning 0.0037 Excellent
Code 0.0065 Excellent
Chat 0.0143 Good
Creative 0.0267 Good
Overall 0.0103 Good

JSD scale: 0 = identical, 1 = completely different. < 0.01 excellent, < 0.05 good.

Usage with vLLM

vllm serve nivvis/Qwen3.5-35B-A3B-heretic-v2-NVFP4-W4A4 \
    --max-num-seqs 32 \
    --max-model-len 262144 \
    --gpu-memory-utilization 0.85 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --trust-remote-code

SM 120 (RTX PRO 6000, RTX 5090) users

The TRTLLM fused MoE kernel has a hardcoded SM 100 check that fails on SM 120. Add this environment variable:

VLLM_USE_FLASHINFER_MOE_FP4=0 vllm serve ...

This falls through to the VLLM_CUTLASS MoE backend which works on SM 120.

Performance

Metric Value
VRAM 21.88 GiB
Decode throughput ~130 t/s (single request)
Hardware tested NVIDIA RTX PRO 6000 Blackwell Max-Q (SM 120)
MoE backend VLLM_CUTLASS
Dense backend VLLM_CUTLASS
vLLM version 0.17.0rc1

Known Issues

  • SM 120 TRTLLM MoE kernel: Requires VLLM_USE_FLASHINFER_MOE_FP4=0 (see above)
  • No MTP: Heretic abliteration strips MTP weights. Do not use speculation flags. MTP head distillation from the original Qwen model is a planned future improvement.
  • Thinking model: Very verbose chain-of-thought. Use a system prompt to disable or set high max_tokens.

Quantization Tooling

Calibrated with llm-compressor PR #2383 (Qwen3.5 support, not yet merged). Post-processing required to fix checkpoint format for vLLM (config grafting, tensor merging, ignore list prefix correction).

Credits

Downloads last month
156
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nivvis/Qwen3.5-35B-A3B-heretic-v2-NVFP4

Collection including nivvis/Qwen3.5-35B-A3B-heretic-v2-NVFP4