Qwen3.5-122B-A10B-heretic-int4-AutoRound

The first INT4 AutoRound quantization of the Heretic (uncensored) Qwen3.5-122B — with vision preserved.

A 4-bit symmetric quantization of trohrbaugh/Qwen3.5-122B-A10B-heretic, generated using Intel AutoRound v0.12.0. The original multimodal architecture (Qwen3_5MoeForConditionalGeneration) is fully preserved — text, vision, video, and reasoning all work.

Base Model Qwen/Qwen3.5-122B-A10Btrohrbaugh/Qwen3.5-122B-A10B-heretic
Quantization INT4 symmetric, group_size 128, AutoRound (sign-gradient descent)
Packing auto_round:auto_gptq (GPTQ Marlin compatible)
Size on Disk 63 GB (vs 234 GB BF16 original — 73% smaller)
Format 63 safetensors shards
Architecture 122B total params, 10B active/token (256 experts, 8 routed + 1 shared)
Context 262,144 tokens natively
Capabilities Text, Code, Reasoning, Tool Calling, Vision, Video
License Apache 2.0 (inherited from Qwen)

Model Lineage

This model has a three-step provenance chain:

Qwen/Qwen3.5-122B-A10B          ← Qwen team's flagship 122B MoE (BF16, 234GB)
  └─► trohrbaugh/heretic         ← Abliterated with Heretic v1.2.0 (KL=0.0916, 9/100 refusals)
       └─► THIS MODEL            ← INT4 AutoRound quantization (63GB, vision preserved)
  1. Qwen3.5-122B-A10B — Qwen team's 122B Mixture-of-Experts model with DeltaNet hybrid attention, 256 experts (8 active + 1 shared per token), native 262K context, and multimodal vision support.

  2. trohrbaugh/Qwen3.5-122B-A10B-heretic — Abliterated (uncensored) variant using Heretic v1.2.0 by p-e-w. Uses parametrized directional ablation with interpolated direction index (41.20), applied to attn.o_proj and mlp.down_proj. Reduces refusals from 99/100 to 9/100 while maintaining low KL divergence (0.0916).

  3. This model — INT4 AutoRound quantization targeting unified-memory GPU systems (NVIDIA DGX Spark / ASUS Ascent GX10). All 333 vision weights, 192 shared expert layers, 48 MoE gate layers, and lm_head preserved at full precision. Text module quantized to INT4 W4G128.


Quantization Details

Parameter Value
Method Intel AutoRound v0.12.0 (sign-gradient descent)
Bits 4 (INT4)
Group Size 128
Symmetric Yes
Packing auto_round:auto_gptq (GPTQ Marlin backend in vLLM)
Block Quantized model.language_model.layers (text module only)
Layers Quantized 37,092 / 37,395 (99.2%)

What Was Preserved (NOT Quantized)

Keeping critical layers at full precision is essential for MoE model quality:

  • 333 vision encoder weights (model.visual.blocks.*) — BF16
  • 192 shared expert projections (48 layers × gate_proj, up_proj, down_proj) — FP16
  • 48 shared expert gates (shared_expert_gate) — FP16
  • 48 MoE routing gates — FP16
  • lm_head — original precision

The shared expert is activated for every token (unlike routed experts), so preserving its precision is critical for maintaining output quality. The vision encoder is kept at BF16 to preserve multimodal capabilities.

Why AutoRound?

AutoRound uses sign-gradient descent to optimize weight rounding decisions, achieving better accuracy than RTN (round-to-nearest) and competitive results with GPTQ/AWQ while being faster and more robust. The GPTQ-compatible packing format (auto_round:auto_gptq) means this model works with vLLM's optimized Marlin CUDA kernels out of the box.


Model Size Comparison

Variant Size on Disk Reduction
Qwen3.5-122B-A10B (BF16) ~234 GB
Intel/Qwen3.5-122B-A10B-int4-AutoRound (canonical) ~72 GB 69%
This model (Heretic INT4) 63 GB 73%

The 9 GB difference from Intel's canonical INT4 is because the Heretic fork does not include MTP (Multi-Token Prediction) weights that are present in the base Qwen model.


Performance (Measured on DGX Spark Cluster)

Tested on a 2-node NVIDIA DGX Spark cluster (2× Grace Blackwell GB10, 128GB unified memory each) with tensor parallelism over 200Gbps RoCE RDMA:

Metric Value
Decode Speed 24–27 tok/s (single user)
Context Window 225,000 tokens (tested)
GPU Memory 50% utilization per node (TP=2)
Thinking Mode Toggleable (/think and /no_think)
Tool Calling Parallel tool calls with correct argument generation

Head-to-Head: Heretic INT4 vs Cardinal (Canonical) INT4

We ran identical test batteries against both this model and Intel/Qwen3.5-122B-A10B-int4-AutoRound on the same hardware:

Test Cardinal (Canonical INT4) Heretic INT4 (This Model)
Code Generation (rate limiter module, thinking enabled) 7,060 tokens, 284s 7,150 tokens, 291s
Reasoning (distributed systems architecture) 4,176 tokens, 24.9 tok/s 3,495 tokens, 27.5 tok/s
Tool Calling (parallel function calls) 2 calls, correct args 2 calls, correct args

Verdict: Quality is essentially identical. The Heretic is slightly faster on reasoning tasks (27.5 vs 24.9 tok/s). Both models produced complete, working code with proper error handling and TypeScript types. The Heretic's code generation included a clock.ts time abstraction — a marginally more mature architectural pattern.


Quick Start

vLLM (Recommended)

pip install vllm>=0.17.0

vllm serve happypatrick/Qwen3.5-122B-A10B-heretic-int4-AutoRound \
  --served-model-name qwen3.5-122b \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.85 \
  --enforce-eager \
  --enable-prefix-caching \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml \
  --trust-remote-code

Then query with any OpenAI-compatible client:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-122b",
    "messages": [{"role": "user", "content": "Write a Python quicksort implementation"}],
    "temperature": 0.6
  }'

HuggingFace Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "happypatrick/Qwen3.5-122B-A10B-heretic-int4-AutoRound"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Hello!"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

DGX Spark / ASUS Ascent GX10 Deployment Guide

This model was specifically tested on NVIDIA DGX Spark (also sold as ASUS Ascent GX10) — a desktop workstation with a Grace Blackwell GB10 SoC and 128 GB unified CPU+GPU memory. Here's how to run it:

Single Node (128 GB Unified Memory)

# Conservative settings — leaves room for ~50GB KV cache
vllm serve /path/to/heretic-int4-autoround-v2 \
  --served-model-name qwen3.5-122b \
  --host 0.0.0.0 --port 8000 \
  --gpu-memory-utilization 0.75 \
  --max-model-len 131072 \
  --enforce-eager \
  --enable-prefix-caching \
  --reasoning-parser qwen3 \
  --trust-remote-code

Warning: Do NOT set --gpu-memory-utilization above 0.85 on unified memory systems. Higher values can cause system freezes. Start at 0.75 and increase carefully.

Dual Node (2× DGX Spark, 256 GB Total)

For two DGX Sparks connected via 200Gbps RoCE RDMA (the setup we tested on):

# Node 2 (worker) — start first
docker run -d --name vllm-worker \
  --gpus all --network=host --ipc=host --privileged \
  --shm-size 10.24g \
  -v /path/to/models:/models \
  -e NCCL_SOCKET_IFNAME=enP2p1s0f0np0 \
  -e NCCL_IB_HCA=roceP2p1s0f0 \
  -e NCCL_P2P_DISABLE=1 \
  -e VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=3600 \
  vllm/vllm-openai:latest \
  bash -c "ray start --address=<NODE1_IP>:6399 --num-gpus=1 --node-ip-address=<NODE2_IP> --block"

# Node 1 (head) — start second
docker run -d --name vllm-head \
  --gpus all --network=host --ipc=host --privileged \
  --shm-size 10.24g \
  -v /path/to/models:/models \
  -e NCCL_SOCKET_IFNAME=enP2p1s0f0np0 \
  -e NCCL_IB_HCA=roceP2p1s0f0 \
  -e NCCL_P2P_DISABLE=1 \
  -e VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=3600 \
  vllm/vllm-openai:latest \
  bash -c "ray start --head --port=6399 --num-gpus=1 --node-ip-address=<NODE1_IP> && \
  vllm serve /models/heretic-int4-autoround-v2 \
    --served-model-name qwen3.5-122b \
    --host 0.0.0.0 --port 8000 \
    --tensor-parallel-size 2 \
    --distributed-executor-backend ray \
    --max-model-len 225000 \
    --gpu-memory-utilization 0.50 \
    --enforce-eager \
    --enable-prefix-caching \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_xml \
    --trust-remote-code"

DGX Spark Tips & Gotchas

  • --shm-size 10.24g is critical for multi-node Ray — without it, model loading can hang at ~57% (NCCL deadlock)
  • VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=3600 prevents non-streaming requests from timing out at the default 300s for TP>1 configurations
  • --enforce-eager is recommended — CUDA graph compilation is slow on GB10 and the memory savings aren't worth it at these model sizes
  • NCCL_P2P_DISABLE=1 is required for GB10's unified memory architecture — P2P transfers hang
  • Set vm.swappiness=1 at the OS level (sysctl vm.swappiness=1) to prevent the kernel from swapping GPU-mapped pages
  • The 200Gbps DAC cable (RoCE RDMA) between nodes provides enough bandwidth for TP=2 tensor parallelism with minimal overhead

Reproduce the Quantization

Requirements

  • GPU: 2× NVIDIA H200 SXM (or similar — needs ~80GB combined VRAM + 450GB RAM)
  • Time: ~4 hours 12 minutes
  • Cost: ~$33.50 on RunPod ($7.18/hr for 2× H200 SXM pod)

Setup

# Create a RunPod instance with 2x H200 SXM, 2TB RAM
# SSH in and install dependencies

pip install torch>=2.5.0
pip install git+https://github.com/intel/auto-round.git    # v0.12.0+ (PyPI is too old)
pip install git+https://github.com/huggingface/transformers.git  # v5.3.0+

# Download the Heretic base model (~234GB, takes a while)
huggingface-cli download trohrbaugh/Qwen3.5-122B-A10B-heretic \
  --local-dir /workspace/heretic-bf16

Quantize

auto-round \
  --model_name "/workspace/heretic-bf16" \
  --output_dir "/workspace/output/heretic-int4" \
  --ignore_layers shared_expert \
  --device_map "auto"

That's it. AutoRound handles everything else — calibration data, iteration count, group size, and format are all set to sensible defaults.

What Happens During Quantization

  • AutoRound processes each transformer layer sequentially (~5.3 minutes per layer, 48 layers)
  • For each layer, it uses sign-gradient descent to optimize rounding decisions (INT4 W4G128)
  • Shared expert layers are automatically skipped (kept at FP16) due to --ignore_layers shared_expert
  • Vision encoder weights are preserved at BF16 (they live outside model.language_model.layers)
  • Peak memory: 453.8 GB RAM, 35-45 GB VRAM per GPU
  • Each layer shows 50-60% loss reduction from the optimization

Learnings from the Process

  1. Use auto-round from git, not PyPI. The PyPI release (0.10.2) doesn't support Qwen3.5 MoE models. You need the merged PR #1476 which adds Qwen3_5MoeForConditionalGeneration support.

  2. Transformers must also be from git. The released version doesn't have the Qwen3.5 model class yet.

  3. PyTorch 2.4 will fail with AttributeError: 'Linear' object has no attribute 'set_submodule'. Use PyTorch 2.5+.

  4. --ignore_layers shared_expert is important. The shared expert is activated for every single token (unlike the 255 routed experts where only 8 fire per token). Quantizing it hurts output quality disproportionately. Intel uses the same flag for the canonical model.

  5. --device_map "auto" distributes across GPUs correctly. With 2× H200, the model splits cleanly and quantization runs in parallel where possible.

  6. The Heretic fork has no MTP weights. This is why our output is 63GB instead of the 72GB you'd get from quantizing the canonical Qwen model. Not a quality issue — MTP (Multi-Token Prediction) is a speculative decoding feature, not a capability.

  7. Vision is automatically preserved. Because we set block_name_to_quantize to model.language_model.layers, the vision encoder (model.visual.*) is never touched. You get full multimodal support in the quantized model.


Abliteration Notice

This model is based on the "Heretic" variant of Qwen3.5-122B-A10B, which has had safety refusals significantly reduced using directional ablation (Heretic v1.2.0 by p-e-w).

  • The base model refused 99/100 test prompts. The Heretic variant refuses 9/100.
  • KL divergence from the original: 0.0916 (low — general capabilities well preserved).
  • Abliteration method: parametrized directional ablation with interpolated direction index, applied to attn.o_proj and mlp.down_proj across all 48 layers.

This model will follow instructions that the original Qwen model would refuse. Please use responsibly. The creators are not responsible for misuse.


Ethical Considerations and Limitations

  • This is an uncensored model. It has reduced safety guardrails compared to the original Qwen3.5-122B-A10B.
  • Quantization to INT4 introduces a small quality degradation compared to BF16, though our testing shows negligible impact on code generation, reasoning, and tool calling tasks.
  • The model may generate biased, incorrect, or harmful content. Users should implement appropriate safety measures for their applications.
  • This model should not be used to generate content that could cause harm to individuals or groups.

License

This model inherits the Apache 2.0 license from Qwen/Qwen3.5-122B-A10B.


Acknowledgments

  • Qwen Team for the incredible Qwen3.5-122B-A10B base model
  • trohrbaugh for the Heretic abliteration
  • p-e-w for the Heretic abliteration tool
  • Intel for AutoRound and for demonstrating the --ignore_layers shared_expert recipe on the canonical model
  • RunPod for affordable H200 GPU access
  • NVIDIA for the DGX Spark platform

Citation

If you use this model, please cite both the AutoRound quantization method and the original Qwen model:

@article{cheng2024optimize,
  title={Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs},
  author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi},
  journal={arXiv preprint arXiv:2309.05516},
  year={2024}
}

@misc{qwen3.5,
  title={Qwen3.5 Technical Report},
  author={Qwen Team},
  year={2025},
  url={https://qwenlm.github.io/blog/qwen3.5/}
}

Quantized on 2026-03-15 using RunPod 2× H200 SXM. Tested on a 2-node DGX Spark cluster with 200Gbps RoCE RDMA.

Downloads last month
5,358
Safetensors
Model size
19B params
Tensor type
I32
·
BF16
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for happypatrick/Qwen3.5-122B-A10B-heretic-int4-AutoRound

Quantized
(4)
this model

Paper for happypatrick/Qwen3.5-122B-A10B-heretic-int4-AutoRound