Qwen3.5-122B-A10B GPTQ 4-bit

GPTQ 4-bit quantization of Qwen/Qwen3.5-122B-A10B, a 122B-parameter Mixture-of-Experts (MoE) multimodal model with ~10B activated parameters per token.

Includes full vision encoder and MTP (Multi-Token Prediction) module for image understanding and speculative decoding support.

Model Overview

  • Architecture: Qwen3_5MoeForConditionalGeneration (multimodal: text + vision)
  • Total parameters: ~122B
  • Activated parameters: ~10B per token (8 of 256 experts selected per token)
  • Layers: 48 (36 linear attention + 12 full attention, repeating 3:1 pattern)
  • Experts: 256 per layer + 1 shared expert per layer
  • Context length: 262,144 tokens
  • Vision encoder: 27-block ViT (1152 hidden, 16x16 patches), BF16
  • MTP module: 1-layer speculative decoding head, BF16

Quantization Details

All 36,864 MoE expert modules (256 experts x 3 projections x 48 layers) are quantized to INT4 using GPTQ. Non-expert modules (including the full vision encoder and MTP module) remain at BF16/FP16 for quality preservation.

Component Precision Notes
MoE experts (gate_proj, up_proj, down_proj) INT4 (GPTQ) 36,864 modules quantized
Full attention (q_proj, k_proj, v_proj, o_proj) FP16 Every 4th layer (12 layers)
Linear attention (in_proj_qkv, in_proj_z, out_proj) FP16 Full precision (36 layers)
Shared experts FP16 Full precision
Vision encoder (model.visual.*) BF16 Full precision
MTP module (mtp.*) BF16 Full precision
Embeddings, LM head, norms FP16 Full precision

GPTQ configuration:

  • Bits: 4
  • Group size: 32
  • Symmetric: Yes
  • desc_act: No
  • true_sequential: Yes
  • act_group_aware: Yes
  • Failsafe: RTN for poorly-calibrated rare experts (1,452 of 36,864 modules, ~3.9%)

Calibration

  • Dataset: Mixed - evol-codealpaca-v1 (code) + C4 (general text)
  • Samples: 2,048
  • Quantizer: GPTQModel v5.7.1
  • Quantization time: ~21 hours on 4x AMD MI100 (32GB)

Model Size

Version Size Compression
BF16 (original) 234 GB -
GPTQ 4-bit 80 GB 2.9x

Perplexity

Evaluated on wikitext-2-raw-v1 (test set), seq_len=2048, non-overlapping chunks:

Model Perplexity Method
BF16 (original) 4.8366 llama-perplexity (GGUF BF16, ctx=2048)
GPTQ 4-bit 5.1206 vLLM API logprobs (stride=2048)

Note on comparability: These measurements use different inference backends (llama.cpp vs vLLM) because the full BF16 model (234 GB) does not fit in vLLM on this system. Differences in numerical precision, tokenization, and logprob collection method mean the values are not directly comparable.

Usage

vLLM (Recommended for Serving)

vllm serve btbtyler09/Qwen3.5-122B-A10B-GPTQ-4bit \
  --gpu-memory-utilization 0.95 \
  --max-model-len 32768 \
  --tensor-parallel-size 4 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder \
  --dtype float16 \
  --skip-mm-profiling \
  --limit-mm-per-prompt '{"image": 2}'
Parameter Description
--gpu-memory-utilization 0.95 Use 95% of GPU VRAM for KV cache + weights
--max-model-len 32768 32K context (increase if GPU memory allows)
--tensor-parallel-size 4 Shard across 4 GPUs (adjust to your setup)
--reasoning-parser qwen3 Enable thinking/reasoning token parsing
--enable-auto-tool-choice --tool-call-parser qwen3_coder Enable tool/function calling
--dtype float16 Run in FP16 (required for ROCm GPTQ kernels)
--skip-mm-profiling Skip multimodal memory profiling at startup
--limit-mm-per-prompt '{"image": 2}' Allow up to 2 images per request

Context length note: The 122B model weighs ~80GB, leaving limited KV cache room in 4x32GB. Start with --max-model-len 32768 and increase if memory permits. For full 256K context, use 8+ GPUs or GPUs with more VRAM.

vLLM bug workaround: vLLM versions up to at least 0.15.2 have a bug in Qwen3_5MoeTextConfig where ignore_keys_at_rope_validation is defined as a list instead of a set, causing a TypeError during config parsing. Apply this fix before serving:

python3 -c "
for f in [
    '/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/configs/qwen3_5_moe.py',
    '/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/configs/qwen3_5.py',
]:
    t = open(f).read()
    t = t.replace(
        'ignore_keys_at_rope_validation\"] = [\n            \"mrope_section\",\n            \"mrope_interleaved\",\n        ]',
        'ignore_keys_at_rope_validation\"] = {\n            \"mrope_section\",\n            \"mrope_interleaved\",\n        }')
    open(f,'w').write(t)
    print('Patched', f)
"

Vision Example (via OpenAI API)

import base64, requests

with open("image.png", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

response = requests.post("http://localhost:8000/v1/chat/completions", json={
    "model": "btbtyler09/Qwen3.5-122B-A10B-GPTQ-4bit",
    "messages": [{"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
        {"type": "text", "text": "Describe what you see in this image."},
    ]}],
    "max_tokens": 1024,
})
print(response.json()["choices"][0]["message"]["content"])

GPTQModel / transformers

Note: Neither GPTQModel nor transformers can currently load this model directly. GPTQModel's Qwen3_5MoeGPTQ class expects the text-only weight prefix (model.layers.*) and does not support the multimodal architecture (model.language_model.layers.*). The transformers GPTQ path delegates to optimum, which does not handle the fused-expert architecture. Use vLLM for inference.

Technical Notes

Qwen3.5-122B-A10B stores MoE expert weights as fused 3D nn.Parameter tensors rather than individual nn.Linear modules. During quantization, GPTQModel's MODULE_CONVERTER_MAP converts these to individual quantizable nn.Linear layers. This same conversion must also run during model loading for the quantized kernels to be applied correctly.

The vision encoder (27-block ViT) and MTP speculative decoding module are preserved at full BF16 precision from the original model. Only the text model's MoE expert weights are quantized. vLLM casts BF16 vision tensors to FP16 at load time when using --dtype float16.

Credits

  • Base Model: Qwen - Qwen3.5-122B-A10B
  • Quantization: GPTQ via GPTQModel v5.7.1
  • Expert Converter: convert_qwen3_5_moe_expert_converter for fused 3D expert weights
  • Quantized by: btbtyler09

License

This model inherits the Apache 2.0 license from the base model.

Downloads last month
2,425
Safetensors
Model size
125B params
Tensor type
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for btbtyler09/Qwen3.5-122B-A10B-GPTQ-4bit

Quantized
(104)
this model