Qwen3.5-122B-A10B GPTQ 4-bit

GPTQ 4-bit quantization of Qwen/Qwen3.5-122B-A10B, a 122B-parameter Mixture-of-Experts (MoE) multimodal model with ~10B activated parameters per token.

Includes full vision encoder and MTP (Multi-Token Prediction) module for image understanding and speculative decoding support.

Model Overview

Architecture: Qwen3_5MoeForConditionalGeneration (multimodal: text + vision)
Total parameters: ~122B
Activated parameters: ~10B per token (8 of 256 experts selected per token)
Layers: 48 (36 linear attention + 12 full attention, repeating 3:1 pattern)
Experts: 256 per layer + 1 shared expert per layer
Context length: 262,144 tokens
Vision encoder: 27-block ViT (1152 hidden, 16x16 patches), BF16
MTP module: 1-layer speculative decoding head, BF16

Quantization Details

All 36,864 MoE expert modules (256 experts x 3 projections x 48 layers) are quantized to INT4 using GPTQ. Non-expert modules (including the full vision encoder and MTP module) remain at BF16/FP16 for quality preservation.

Component	Precision	Notes
MoE experts (`gate_proj`, `up_proj`, `down_proj`)	INT4 (GPTQ)	36,864 modules quantized
Full attention (`q_proj`, `k_proj`, `v_proj`, `o_proj`)	FP16	Every 4th layer (12 layers)
Linear attention (`in_proj_qkv`, `in_proj_z`, `out_proj`)	FP16	Full precision (36 layers)
Shared experts	FP16	Full precision
Vision encoder (`model.visual.*`)	BF16	Full precision
MTP module (`mtp.*`)	BF16	Full precision
Embeddings, LM head, norms	FP16	Full precision

GPTQ configuration:

Bits: 4
Group size: 32
Symmetric: Yes
desc_act: No
true_sequential: Yes
act_group_aware: Yes
Failsafe: RTN for poorly-calibrated rare experts (1,452 of 36,864 modules, ~3.9%)

Calibration

Dataset: Mixed - evol-codealpaca-v1 (code) + C4 (general text)
Samples: 2,048
Quantizer: GPTQModel v5.7.1
Quantization time: ~21 hours on 4x AMD MI100 (32GB)

Model Size

Version	Size	Compression
BF16 (original)	234 GB	-
GPTQ 4-bit	80 GB	2.9x

Perplexity

Evaluated on wikitext-2-raw-v1 (test set), seq_len=2048, non-overlapping chunks:

Model	Perplexity	Method
BF16 (original)	4.8366	llama-perplexity (GGUF BF16, ctx=2048)
GPTQ 4-bit	5.1206	vLLM API logprobs (stride=2048)

Note on comparability: These measurements use different inference backends (llama.cpp vs vLLM) because the full BF16 model (234 GB) does not fit in vLLM on this system. Differences in numerical precision, tokenization, and logprob collection method mean the values are not directly comparable.

Usage

vLLM (Recommended for Serving)

vllm serve btbtyler09/Qwen3.5-122B-A10B-GPTQ-4bit \
  --gpu-memory-utilization 0.95 \
  --max-model-len 32768 \
  --tensor-parallel-size 4 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder \
  --dtype float16 \
  --skip-mm-profiling \
  --limit-mm-per-prompt '{"image": 2}'

Parameter	Description
`--gpu-memory-utilization 0.95`	Use 95% of GPU VRAM for KV cache + weights
`--max-model-len 32768`	32K context (increase if GPU memory allows)
`--tensor-parallel-size 4`	Shard across 4 GPUs (adjust to your setup)
`--reasoning-parser qwen3`	Enable thinking/reasoning token parsing
`--enable-auto-tool-choice --tool-call-parser qwen3_coder`	Enable tool/function calling
`--dtype float16`	Run in FP16 (required for ROCm GPTQ kernels)
`--skip-mm-profiling`	Skip multimodal memory profiling at startup
`--limit-mm-per-prompt '{"image": 2}'`	Allow up to 2 images per request

Context length note: The 122B model weighs ~80GB, leaving limited KV cache room in 4x32GB. Start with --max-model-len 32768 and increase if memory permits. For full 256K context, use 8+ GPUs or GPUs with more VRAM.

vLLM bug workaround: vLLM versions up to at least 0.15.2 have a bug in Qwen3_5MoeTextConfig where ignore_keys_at_rope_validation is defined as a list instead of a set, causing a TypeError during config parsing. Apply this fix before serving:

python3 -c "
for f in [
    '/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/configs/qwen3_5_moe.py',
    '/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/configs/qwen3_5.py',
]:
    t = open(f).read()
    t = t.replace(
        'ignore_keys_at_rope_validation\"] = [\n            \"mrope_section\",\n            \"mrope_interleaved\",\n        ]',
        'ignore_keys_at_rope_validation\"] = {\n            \"mrope_section\",\n            \"mrope_interleaved\",\n        }')
    open(f,'w').write(t)
    print('Patched', f)
"

Vision Example (via OpenAI API)

import base64, requests

with open("image.png", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

response = requests.post("http://localhost:8000/v1/chat/completions", json={
    "model": "btbtyler09/Qwen3.5-122B-A10B-GPTQ-4bit",
    "messages": [{"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
        {"type": "text", "text": "Describe what you see in this image."},
    ]}],
    "max_tokens": 1024,
})
print(response.json()["choices"][0]["message"]["content"])

GPTQModel / transformers

Note: Neither GPTQModel nor transformers can currently load this model directly. GPTQModel's Qwen3_5MoeGPTQ class expects the text-only weight prefix (model.layers.*) and does not support the multimodal architecture (model.language_model.layers.*). The transformers GPTQ path delegates to optimum, which does not handle the fused-expert architecture. Use vLLM for inference.

Technical Notes

Qwen3.5-122B-A10B stores MoE expert weights as fused 3D nn.Parameter tensors rather than individual nn.Linear modules. During quantization, GPTQModel's MODULE_CONVERTER_MAP converts these to individual quantizable nn.Linear layers. This same conversion must also run during model loading for the quantized kernels to be applied correctly.

The vision encoder (27-block ViT) and MTP speculative decoding module are preserved at full BF16 precision from the original model. Only the text model's MoE expert weights are quantized. vLLM casts BF16 vision tensors to FP16 at load time when using --dtype float16.

Credits

Base Model: Qwen - Qwen3.5-122B-A10B
Quantization: GPTQ via GPTQModel v5.7.1
Expert Converter: convert_qwen3_5_moe_expert_converter for fused 3D expert weights
Quantized by: btbtyler09

License

This model inherits the Apache 2.0 license from the base model.

Downloads last month: 2,425

Safetensors

Model size

125B params

Tensor type

I32

BF16

Model tree for btbtyler09/Qwen3.5-122B-A10B-GPTQ-4bit

Base model

Qwen/Qwen3.5-122B-A10B

Quantized

(104)

this model