Qwen3.5-122B-A10B-heretic-int4-AutoRound
The first INT4 AutoRound quantization of the Heretic (uncensored) Qwen3.5-122B — with vision preserved.
A 4-bit symmetric quantization of trohrbaugh/Qwen3.5-122B-A10B-heretic, generated using Intel AutoRound v0.12.0. The original multimodal architecture (Qwen3_5MoeForConditionalGeneration) is fully preserved — text, vision, video, and reasoning all work.
| Base Model | Qwen/Qwen3.5-122B-A10B → trohrbaugh/Qwen3.5-122B-A10B-heretic |
| Quantization | INT4 symmetric, group_size 128, AutoRound (sign-gradient descent) |
| Packing | auto_round:auto_gptq (GPTQ Marlin compatible) |
| Size on Disk | 63 GB (vs 234 GB BF16 original — 73% smaller) |
| Format | 63 safetensors shards |
| Architecture | 122B total params, 10B active/token (256 experts, 8 routed + 1 shared) |
| Context | 262,144 tokens natively |
| Capabilities | Text, Code, Reasoning, Tool Calling, Vision, Video |
| License | Apache 2.0 (inherited from Qwen) |
Model Lineage
This model has a three-step provenance chain:
Qwen/Qwen3.5-122B-A10B ← Qwen team's flagship 122B MoE (BF16, 234GB)
└─► trohrbaugh/heretic ← Abliterated with Heretic v1.2.0 (KL=0.0916, 9/100 refusals)
└─► THIS MODEL ← INT4 AutoRound quantization (63GB, vision preserved)
Qwen3.5-122B-A10B — Qwen team's 122B Mixture-of-Experts model with DeltaNet hybrid attention, 256 experts (8 active + 1 shared per token), native 262K context, and multimodal vision support.
trohrbaugh/Qwen3.5-122B-A10B-heretic — Abliterated (uncensored) variant using Heretic v1.2.0 by p-e-w. Uses parametrized directional ablation with interpolated direction index (41.20), applied to
attn.o_projandmlp.down_proj. Reduces refusals from 99/100 to 9/100 while maintaining low KL divergence (0.0916).This model — INT4 AutoRound quantization targeting unified-memory GPU systems (NVIDIA DGX Spark / ASUS Ascent GX10). All 333 vision weights, 192 shared expert layers, 48 MoE gate layers, and lm_head preserved at full precision. Text module quantized to INT4 W4G128.
Quantization Details
| Parameter | Value |
|---|---|
| Method | Intel AutoRound v0.12.0 (sign-gradient descent) |
| Bits | 4 (INT4) |
| Group Size | 128 |
| Symmetric | Yes |
| Packing | auto_round:auto_gptq (GPTQ Marlin backend in vLLM) |
| Block Quantized | model.language_model.layers (text module only) |
| Layers Quantized | 37,092 / 37,395 (99.2%) |
What Was Preserved (NOT Quantized)
Keeping critical layers at full precision is essential for MoE model quality:
- 333 vision encoder weights (
model.visual.blocks.*) — BF16 - 192 shared expert projections (48 layers ×
gate_proj,up_proj,down_proj) — FP16 - 48 shared expert gates (
shared_expert_gate) — FP16 - 48 MoE routing gates — FP16
- lm_head — original precision
The shared expert is activated for every token (unlike routed experts), so preserving its precision is critical for maintaining output quality. The vision encoder is kept at BF16 to preserve multimodal capabilities.
Why AutoRound?
AutoRound uses sign-gradient descent to optimize weight rounding decisions, achieving better accuracy than RTN (round-to-nearest) and competitive results with GPTQ/AWQ while being faster and more robust. The GPTQ-compatible packing format (auto_round:auto_gptq) means this model works with vLLM's optimized Marlin CUDA kernels out of the box.
Model Size Comparison
| Variant | Size on Disk | Reduction |
|---|---|---|
| Qwen3.5-122B-A10B (BF16) | ~234 GB | — |
| Intel/Qwen3.5-122B-A10B-int4-AutoRound (canonical) | ~72 GB | 69% |
| This model (Heretic INT4) | 63 GB | 73% |
The 9 GB difference from Intel's canonical INT4 is because the Heretic fork does not include MTP (Multi-Token Prediction) weights that are present in the base Qwen model.
Performance (Measured on DGX Spark Cluster)
Tested on a 2-node NVIDIA DGX Spark cluster (2× Grace Blackwell GB10, 128GB unified memory each) with tensor parallelism over 200Gbps RoCE RDMA:
| Metric | Value |
|---|---|
| Decode Speed | 24–27 tok/s (single user) |
| Context Window | 225,000 tokens (tested) |
| GPU Memory | 50% utilization per node (TP=2) |
| Thinking Mode | Toggleable (/think and /no_think) |
| Tool Calling | Parallel tool calls with correct argument generation |
Head-to-Head: Heretic INT4 vs Cardinal (Canonical) INT4
We ran identical test batteries against both this model and Intel/Qwen3.5-122B-A10B-int4-AutoRound on the same hardware:
| Test | Cardinal (Canonical INT4) | Heretic INT4 (This Model) |
|---|---|---|
| Code Generation (rate limiter module, thinking enabled) | 7,060 tokens, 284s | 7,150 tokens, 291s |
| Reasoning (distributed systems architecture) | 4,176 tokens, 24.9 tok/s | 3,495 tokens, 27.5 tok/s |
| Tool Calling (parallel function calls) | 2 calls, correct args | 2 calls, correct args |
Verdict: Quality is essentially identical. The Heretic is slightly faster on reasoning tasks (27.5 vs 24.9 tok/s). Both models produced complete, working code with proper error handling and TypeScript types. The Heretic's code generation included a clock.ts time abstraction — a marginally more mature architectural pattern.
Quick Start
vLLM (Recommended)
pip install vllm>=0.17.0
vllm serve happypatrick/Qwen3.5-122B-A10B-heretic-int4-AutoRound \
--served-model-name qwen3.5-122b \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 131072 \
--gpu-memory-utilization 0.85 \
--enforce-eager \
--enable-prefix-caching \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--trust-remote-code
Then query with any OpenAI-compatible client:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5-122b",
"messages": [{"role": "user", "content": "Write a Python quicksort implementation"}],
"temperature": 0.6
}'
HuggingFace Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "happypatrick/Qwen3.5-122B-A10B-heretic-int4-AutoRound"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Hello!"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
DGX Spark / ASUS Ascent GX10 Deployment Guide
This model was specifically tested on NVIDIA DGX Spark (also sold as ASUS Ascent GX10) — a desktop workstation with a Grace Blackwell GB10 SoC and 128 GB unified CPU+GPU memory. Here's how to run it:
Single Node (128 GB Unified Memory)
# Conservative settings — leaves room for ~50GB KV cache
vllm serve /path/to/heretic-int4-autoround-v2 \
--served-model-name qwen3.5-122b \
--host 0.0.0.0 --port 8000 \
--gpu-memory-utilization 0.75 \
--max-model-len 131072 \
--enforce-eager \
--enable-prefix-caching \
--reasoning-parser qwen3 \
--trust-remote-code
Warning: Do NOT set
--gpu-memory-utilizationabove 0.85 on unified memory systems. Higher values can cause system freezes. Start at 0.75 and increase carefully.
Dual Node (2× DGX Spark, 256 GB Total)
For two DGX Sparks connected via 200Gbps RoCE RDMA (the setup we tested on):
# Node 2 (worker) — start first
docker run -d --name vllm-worker \
--gpus all --network=host --ipc=host --privileged \
--shm-size 10.24g \
-v /path/to/models:/models \
-e NCCL_SOCKET_IFNAME=enP2p1s0f0np0 \
-e NCCL_IB_HCA=roceP2p1s0f0 \
-e NCCL_P2P_DISABLE=1 \
-e VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=3600 \
vllm/vllm-openai:latest \
bash -c "ray start --address=<NODE1_IP>:6399 --num-gpus=1 --node-ip-address=<NODE2_IP> --block"
# Node 1 (head) — start second
docker run -d --name vllm-head \
--gpus all --network=host --ipc=host --privileged \
--shm-size 10.24g \
-v /path/to/models:/models \
-e NCCL_SOCKET_IFNAME=enP2p1s0f0np0 \
-e NCCL_IB_HCA=roceP2p1s0f0 \
-e NCCL_P2P_DISABLE=1 \
-e VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=3600 \
vllm/vllm-openai:latest \
bash -c "ray start --head --port=6399 --num-gpus=1 --node-ip-address=<NODE1_IP> && \
vllm serve /models/heretic-int4-autoround-v2 \
--served-model-name qwen3.5-122b \
--host 0.0.0.0 --port 8000 \
--tensor-parallel-size 2 \
--distributed-executor-backend ray \
--max-model-len 225000 \
--gpu-memory-utilization 0.50 \
--enforce-eager \
--enable-prefix-caching \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--trust-remote-code"
DGX Spark Tips & Gotchas
--shm-size 10.24gis critical for multi-node Ray — without it, model loading can hang at ~57% (NCCL deadlock)VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=3600prevents non-streaming requests from timing out at the default 300s for TP>1 configurations--enforce-eageris recommended — CUDA graph compilation is slow on GB10 and the memory savings aren't worth it at these model sizesNCCL_P2P_DISABLE=1is required for GB10's unified memory architecture — P2P transfers hang- Set
vm.swappiness=1at the OS level (sysctl vm.swappiness=1) to prevent the kernel from swapping GPU-mapped pages - The 200Gbps DAC cable (RoCE RDMA) between nodes provides enough bandwidth for TP=2 tensor parallelism with minimal overhead
Reproduce the Quantization
Requirements
- GPU: 2× NVIDIA H200 SXM (or similar — needs ~80GB combined VRAM + 450GB RAM)
- Time: ~4 hours 12 minutes
- Cost: ~$33.50 on RunPod ($7.18/hr for 2× H200 SXM pod)
Setup
# Create a RunPod instance with 2x H200 SXM, 2TB RAM
# SSH in and install dependencies
pip install torch>=2.5.0
pip install git+https://github.com/intel/auto-round.git # v0.12.0+ (PyPI is too old)
pip install git+https://github.com/huggingface/transformers.git # v5.3.0+
# Download the Heretic base model (~234GB, takes a while)
huggingface-cli download trohrbaugh/Qwen3.5-122B-A10B-heretic \
--local-dir /workspace/heretic-bf16
Quantize
auto-round \
--model_name "/workspace/heretic-bf16" \
--output_dir "/workspace/output/heretic-int4" \
--ignore_layers shared_expert \
--device_map "auto"
That's it. AutoRound handles everything else — calibration data, iteration count, group size, and format are all set to sensible defaults.
What Happens During Quantization
- AutoRound processes each transformer layer sequentially (~5.3 minutes per layer, 48 layers)
- For each layer, it uses sign-gradient descent to optimize rounding decisions (INT4 W4G128)
- Shared expert layers are automatically skipped (kept at FP16) due to
--ignore_layers shared_expert - Vision encoder weights are preserved at BF16 (they live outside
model.language_model.layers) - Peak memory: 453.8 GB RAM, 35-45 GB VRAM per GPU
- Each layer shows 50-60% loss reduction from the optimization
Learnings from the Process
Use
auto-roundfrom git, not PyPI. The PyPI release (0.10.2) doesn't support Qwen3.5 MoE models. You need the merged PR #1476 which addsQwen3_5MoeForConditionalGenerationsupport.Transformers must also be from git. The released version doesn't have the Qwen3.5 model class yet.
PyTorch 2.4 will fail with
AttributeError: 'Linear' object has no attribute 'set_submodule'. Use PyTorch 2.5+.--ignore_layers shared_expertis important. The shared expert is activated for every single token (unlike the 255 routed experts where only 8 fire per token). Quantizing it hurts output quality disproportionately. Intel uses the same flag for the canonical model.--device_map "auto"distributes across GPUs correctly. With 2× H200, the model splits cleanly and quantization runs in parallel where possible.The Heretic fork has no MTP weights. This is why our output is 63GB instead of the 72GB you'd get from quantizing the canonical Qwen model. Not a quality issue — MTP (Multi-Token Prediction) is a speculative decoding feature, not a capability.
Vision is automatically preserved. Because we set
block_name_to_quantizetomodel.language_model.layers, the vision encoder (model.visual.*) is never touched. You get full multimodal support in the quantized model.
Abliteration Notice
This model is based on the "Heretic" variant of Qwen3.5-122B-A10B, which has had safety refusals significantly reduced using directional ablation (Heretic v1.2.0 by p-e-w).
- The base model refused 99/100 test prompts. The Heretic variant refuses 9/100.
- KL divergence from the original: 0.0916 (low — general capabilities well preserved).
- Abliteration method: parametrized directional ablation with interpolated direction index, applied to
attn.o_projandmlp.down_projacross all 48 layers.
This model will follow instructions that the original Qwen model would refuse. Please use responsibly. The creators are not responsible for misuse.
Ethical Considerations and Limitations
- This is an uncensored model. It has reduced safety guardrails compared to the original Qwen3.5-122B-A10B.
- Quantization to INT4 introduces a small quality degradation compared to BF16, though our testing shows negligible impact on code generation, reasoning, and tool calling tasks.
- The model may generate biased, incorrect, or harmful content. Users should implement appropriate safety measures for their applications.
- This model should not be used to generate content that could cause harm to individuals or groups.
License
This model inherits the Apache 2.0 license from Qwen/Qwen3.5-122B-A10B.
Acknowledgments
- Qwen Team for the incredible Qwen3.5-122B-A10B base model
- trohrbaugh for the Heretic abliteration
- p-e-w for the Heretic abliteration tool
- Intel for AutoRound and for demonstrating the
--ignore_layers shared_expertrecipe on the canonical model - RunPod for affordable H200 GPU access
- NVIDIA for the DGX Spark platform
Citation
If you use this model, please cite both the AutoRound quantization method and the original Qwen model:
@article{cheng2024optimize,
title={Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs},
author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi},
journal={arXiv preprint arXiv:2309.05516},
year={2024}
}
@misc{qwen3.5,
title={Qwen3.5 Technical Report},
author={Qwen Team},
year={2025},
url={https://qwenlm.github.io/blog/qwen3.5/}
}
Quantized on 2026-03-15 using RunPod 2× H200 SXM. Tested on a 2-node DGX Spark cluster with 200Gbps RoCE RDMA.
- Downloads last month
- 5,358
Model tree for happypatrick/Qwen3.5-122B-A10B-heretic-int4-AutoRound
Base model
trohrbaugh/Qwen3.5-122B-A10B-heretic