Qwen3.5-35B-A3B-heretic-v2-NVFP4-W4A4

Calibrated NVFP4 W4A4 quantization of llmfan46/Qwen3.5-35B-A3B-heretic-v2 (abliterated via Heretic).

Both weights and activations are quantized to FP4 (E2M1). Activation scales are calibrated from 256 samples of HuggingFaceH4/ultrachat_200k.

Quantization Details

Parameter	Value
Method	NVFP4 W4A4 — calibrated weight + activation FP4
Weight dtype	E2M1 (4-bit float), packed uint8
Block scale	FP8 E4M3, one per 16 elements
Global scale	FP32, one per tensor
Activation quantization	FP4 dynamic, with calibrated `input_global_scale`
Format	`nvfp4-pack-quantized` (compressed-tensors)
Calibration	256 samples, `HuggingFaceH4/ultrachat_200k`, seq_len=4096
Original size	~66 GB (BF16)
VRAM at load	21.88 GiB

What's quantized (FP4)

All MLP expert projections (gate_proj, up_proj, down_proj for 256 experts + shared expert), attention projections (q/k/v/o_proj).

What's kept in BF16

GatedDeltaNet (linear attention) layers, vision tower, norms, embeddings, lm_head, MoE router gates.

Quality (JSD vs FP8)

Evaluated using Jensen-Shannon Divergence with forced-decode on 2,590 tokens across 12 prompts (chat, reasoning, code, creative). Reference model: FP8 block-wise quantization of the same base model.

Category	Mean JSD	Assessment
Reasoning	0.0037	Excellent
Code	0.0065	Excellent
Chat	0.0143	Good
Creative	0.0267	Good
Overall	0.0103	Good

JSD scale: 0 = identical, 1 = completely different. < 0.01 excellent, < 0.05 good.

Usage with vLLM

vllm serve nivvis/Qwen3.5-35B-A3B-heretic-v2-NVFP4-W4A4 \
    --max-num-seqs 32 \
    --max-model-len 262144 \
    --gpu-memory-utilization 0.85 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --trust-remote-code

SM 120 (RTX PRO 6000, RTX 5090) users

The TRTLLM fused MoE kernel has a hardcoded SM 100 check that fails on SM 120. Add this environment variable:

VLLM_USE_FLASHINFER_MOE_FP4=0 vllm serve ...

This falls through to the VLLM_CUTLASS MoE backend which works on SM 120.

Performance

Metric	Value
VRAM	21.88 GiB
Decode throughput	~130 t/s (single request)
Hardware tested	NVIDIA RTX PRO 6000 Blackwell Max-Q (SM 120)
MoE backend	VLLM_CUTLASS
Dense backend	VLLM_CUTLASS
vLLM version	0.17.0rc1

Known Issues

SM 120 TRTLLM MoE kernel: Requires VLLM_USE_FLASHINFER_MOE_FP4=0 (see above)
No MTP: Heretic abliteration strips MTP weights. Do not use speculation flags. MTP head distillation from the original Qwen model is a planned future improvement.
Thinking model: Very verbose chain-of-thought. Use a system prompt to disable or set high max_tokens.

Quantization Tooling

Calibrated with llm-compressor PR #2383 (Qwen3.5 support, not yet merged). Post-processing required to fix checkpoint format for vLLM (config grafting, tensor merging, ignore list prefix correction).