Qwen3.5-122B-A10B GPTQ 4-bit
GPTQ 4-bit quantization of Qwen/Qwen3.5-122B-A10B, a 122B-parameter Mixture-of-Experts (MoE) multimodal model with ~10B activated parameters per token.
Includes full vision encoder and MTP (Multi-Token Prediction) module for image understanding and speculative decoding support.
Model Overview
- Architecture: Qwen3_5MoeForConditionalGeneration (multimodal: text + vision)
- Total parameters: ~122B
- Activated parameters: ~10B per token (8 of 256 experts selected per token)
- Layers: 48 (36 linear attention + 12 full attention, repeating 3:1 pattern)
- Experts: 256 per layer + 1 shared expert per layer
- Context length: 262,144 tokens
- Vision encoder: 27-block ViT (1152 hidden, 16x16 patches), BF16
- MTP module: 1-layer speculative decoding head, BF16
Quantization Details
All 36,864 MoE expert modules (256 experts x 3 projections x 48 layers) are quantized to INT4 using GPTQ. Non-expert modules (including the full vision encoder and MTP module) remain at BF16/FP16 for quality preservation.
| Component | Precision | Notes |
|---|---|---|
MoE experts (gate_proj, up_proj, down_proj) |
INT4 (GPTQ) | 36,864 modules quantized |
Full attention (q_proj, k_proj, v_proj, o_proj) |
FP16 | Every 4th layer (12 layers) |
Linear attention (in_proj_qkv, in_proj_z, out_proj) |
FP16 | Full precision (36 layers) |
| Shared experts | FP16 | Full precision |
Vision encoder (model.visual.*) |
BF16 | Full precision |
MTP module (mtp.*) |
BF16 | Full precision |
| Embeddings, LM head, norms | FP16 | Full precision |
GPTQ configuration:
- Bits: 4
- Group size: 32
- Symmetric: Yes
- desc_act: No
- true_sequential: Yes
- act_group_aware: Yes
- Failsafe: RTN for poorly-calibrated rare experts (1,452 of 36,864 modules, ~3.9%)
Calibration
- Dataset: Mixed - evol-codealpaca-v1 (code) + C4 (general text)
- Samples: 2,048
- Quantizer: GPTQModel v5.7.1
- Quantization time: ~21 hours on 4x AMD MI100 (32GB)
Model Size
| Version | Size | Compression |
|---|---|---|
| BF16 (original) | 234 GB | - |
| GPTQ 4-bit | 80 GB | 2.9x |
Perplexity
Evaluated on wikitext-2-raw-v1 (test set), seq_len=2048, non-overlapping chunks:
| Model | Perplexity | Method |
|---|---|---|
| BF16 (original) | 4.8366 | llama-perplexity (GGUF BF16, ctx=2048) |
| GPTQ 4-bit | 5.1206 | vLLM API logprobs (stride=2048) |
Note on comparability: These measurements use different inference backends (llama.cpp vs vLLM) because the full BF16 model (234 GB) does not fit in vLLM on this system. Differences in numerical precision, tokenization, and logprob collection method mean the values are not directly comparable.
Usage
vLLM (Recommended for Serving)
vllm serve btbtyler09/Qwen3.5-122B-A10B-GPTQ-4bit \
--gpu-memory-utilization 0.95 \
--max-model-len 32768 \
--tensor-parallel-size 4 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice --tool-call-parser qwen3_coder \
--dtype float16 \
--skip-mm-profiling \
--limit-mm-per-prompt '{"image": 2}'
| Parameter | Description |
|---|---|
--gpu-memory-utilization 0.95 |
Use 95% of GPU VRAM for KV cache + weights |
--max-model-len 32768 |
32K context (increase if GPU memory allows) |
--tensor-parallel-size 4 |
Shard across 4 GPUs (adjust to your setup) |
--reasoning-parser qwen3 |
Enable thinking/reasoning token parsing |
--enable-auto-tool-choice --tool-call-parser qwen3_coder |
Enable tool/function calling |
--dtype float16 |
Run in FP16 (required for ROCm GPTQ kernels) |
--skip-mm-profiling |
Skip multimodal memory profiling at startup |
--limit-mm-per-prompt '{"image": 2}' |
Allow up to 2 images per request |
Context length note: The 122B model weighs ~80GB, leaving limited KV cache room in 4x32GB. Start with
--max-model-len 32768and increase if memory permits. For full 256K context, use 8+ GPUs or GPUs with more VRAM.
vLLM bug workaround: vLLM versions up to at least 0.15.2 have a bug in
Qwen3_5MoeTextConfigwhereignore_keys_at_rope_validationis defined as alistinstead of aset, causing aTypeErrorduring config parsing. Apply this fix before serving:python3 -c " for f in [ '/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/configs/qwen3_5_moe.py', '/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/configs/qwen3_5.py', ]: t = open(f).read() t = t.replace( 'ignore_keys_at_rope_validation\"] = [\n \"mrope_section\",\n \"mrope_interleaved\",\n ]', 'ignore_keys_at_rope_validation\"] = {\n \"mrope_section\",\n \"mrope_interleaved\",\n }') open(f,'w').write(t) print('Patched', f) "
Vision Example (via OpenAI API)
import base64, requests
with open("image.png", "rb") as f:
b64 = base64.b64encode(f.read()).decode()
response = requests.post("http://localhost:8000/v1/chat/completions", json={
"model": "btbtyler09/Qwen3.5-122B-A10B-GPTQ-4bit",
"messages": [{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
{"type": "text", "text": "Describe what you see in this image."},
]}],
"max_tokens": 1024,
})
print(response.json()["choices"][0]["message"]["content"])
GPTQModel / transformers
Note: Neither GPTQModel nor transformers can currently load this model directly. GPTQModel's
Qwen3_5MoeGPTQclass expects the text-only weight prefix (model.layers.*) and does not support the multimodal architecture (model.language_model.layers.*). The transformers GPTQ path delegates tooptimum, which does not handle the fused-expert architecture. Use vLLM for inference.
Technical Notes
Qwen3.5-122B-A10B stores MoE expert weights as fused 3D nn.Parameter tensors rather than individual nn.Linear modules. During quantization, GPTQModel's MODULE_CONVERTER_MAP converts these to individual quantizable nn.Linear layers. This same conversion must also run during model loading for the quantized kernels to be applied correctly.
The vision encoder (27-block ViT) and MTP speculative decoding module are preserved at full BF16 precision from the original model. Only the text model's MoE expert weights are quantized. vLLM casts BF16 vision tensors to FP16 at load time when using --dtype float16.
Credits
- Base Model: Qwen - Qwen3.5-122B-A10B
- Quantization: GPTQ via GPTQModel v5.7.1
- Expert Converter:
convert_qwen3_5_moe_expert_converterfor fused 3D expert weights - Quantized by: btbtyler09
License
This model inherits the Apache 2.0 license from the base model.
- Downloads last month
- 2,425
Model tree for btbtyler09/Qwen3.5-122B-A10B-GPTQ-4bit
Base model
Qwen/Qwen3.5-122B-A10B