β‘ Qwen3-30B-A3B-Instruct-2507-AWQ-W4A16
Quality-focused AWQ quantization of Qwen3-30B-A3B-Instruct-2507 with GQA-aware attention preservation β 61GB β ~18GB.
A carefully optimized AWQ (W4A16) quantization of Qwen's updated Mixture-of-Experts instruct model (July 2025). Unlike standard AWQ quantizations of this model, this version preserves full-precision attention layers to avoid the quality degradation caused by unprotected quantization of GQA attention projections.
π― What Makes This Different
Most AWQ quantizations of GQA models (including those from official sources) silently produce 48 warnings per run:
skipping AWQ for model.layers.X.self_attn.v_proj ... incompatible balance layers
This happens because AWQ's smoothing algorithm cannot coordinate scaling between v_proj (4 KV heads, 512 dim) and o_proj (32 Q heads, 4096 dim) in Grouped Query Attention architectures. The result: o_proj gets quantized to W4 without any activation-aware protection β the worst-case scenario for quantization quality.
Our approach
We identified that when AWQ cannot perform its core function (activation-weighted channel protection) on the attention layers, it's better to keep them at full precision rather than apply blind quantization:
| Component | Standard AWQ | This model |
|---|---|---|
| q/k/v/o_proj (attention) | W4 without smoothing β οΈ | BF16 full precision β |
| gate/up/down_proj (MLP experts) | W4 with AWQ smoothing β | W4 with AWQ smoothing β |
| MoE routing gates | BF16 β | BF16 β |
| lm_head | BF16 β | BF16 β |
The MLP expert layers (gate_proj, up_proj, down_proj) β which represent 96%+ of the model's parameters β receive full AWQ treatment with both smoothing mappings intact:
post_attention_layernorm β [gate_proj, up_proj]up_proj β [down_proj]
The attention layers (~1.3 GB) stay in BF16, adding minimal overhead for significantly better output quality.
ποΈ Model Overview
| 𧬠Architecture | Qwen3-MoE β standard transformer with Mixture-of-Experts |
| π Parameters | 30B total, 3B active per token (128 experts, top-8 routing) |
| ποΈ Quantization | AWQ W4A16 (4-bit integer weights, 16-bit activations) |
| π¦ Size | |
| π§ Format | compressed-tensors β native vLLM support |
| π Context | 262,144 tokens natively |
| π§ Attention | BF16 full precision (GQA-aware preservation) |
| βοΈ MLP Experts | W4 with full AWQ smoothing (18,432 modules) |
π What's New in 2507
This is the quantized version of the July 2025 update, which brings significant improvements over the original Qwen3-30B-A3B:
- Improved instruction following, logical reasoning, math, science, coding and tool usage
- Better long-tail knowledge coverage across multiple languages
- Enhanced alignment with user preferences in open-ended tasks
- Improved 256K long-context understanding
Note: This model supports only non-thinking mode and does not generate
<think></think>blocks.
π Quick Start
vLLM (recommended)
vllm serve Sophia-AI/Qwen3-30B-A3B-Instruct-2507-AWQ-W4A16
vLLM with Docker
docker run --gpus all \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model Sophia-AI/Qwen3-30B-A3B-Instruct-2507-AWQ-W4A16
Python (OpenAI-compatible API)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
model="Sophia-AI/Qwen3-30B-A3B-Instruct-2507-AWQ-W4A16",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain mixture-of-experts architectures in simple terms."},
],
max_tokens=512,
)
print(response.choices[0].message.content)
Python (Transformers)
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Sophia-AI/Qwen3-30B-A3B-Instruct-2507-AWQ-W4A16",
torch_dtype="auto",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
"Sophia-AI/Qwen3-30B-A3B-Instruct-2507-AWQ-W4A16"
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What are the key improvements in Qwen3's July 2025 update?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
π¬ Quantization Details
Method
AWQ (Activation-Aware Weight Quantization) using llmcompressor with GQA-aware custom mappings and selective layer preservation.
Calibration
| π Samples | 256 |
| π Sequence length | 1024 tokens (p90-optimized) |
| π Calibration language | Italian |
| π MoE coverage | All 128 experts calibrated (moe_calibrate_all_experts=True) |
| βοΈ Pipeline | Sequential (layer-by-layer) |
| π₯οΈ Hardware | 1Γ NVIDIA B200 SXM (192 GB VRAM) |
Preserved Layers (not quantized)
The following layers are kept in their original BF16 precision:
| Pattern | Count | Reason |
|---|---|---|
lm_head |
1 | Output projection β critical for token prediction |
mlp.gate |
48 | MoE routing gates β high impact on expert selection |
self_attn.q_proj |
48 | Query projection β GQA incompatible with AWQ smoothing |
self_attn.k_proj |
48 | Key projection β GQA incompatible with AWQ smoothing |
self_attn.v_proj |
48 | Value projection β GQA incompatible with AWQ smoothing |
self_attn.o_proj |
48 | Output projection β no AWQ smoothing possible without v_proj coordination |
A total of 241 modules are preserved in original precision. All remaining 18,432 MLP expert modules are quantized with full AWQ protection.
Why Preserve Attention?
Qwen3-30B-A3B uses Grouped Query Attention (32 Q heads, 4 KV heads). AWQ smoothing works by coordinating scaling between consecutive layers:
v_proj (512 dim) ββscaleβββ o_proj (4096 dim)
With GQA, the dimension mismatch (8:1 ratio) breaks this coordination. AWQ cannot compute proper per-channel protection for o_proj. Rather than quantize it without protection β which other quantizations do β we keep the entire attention block in BF16 for consistent precision.
π» Hardware Requirements
| Setup | VRAM | Notes |
|---|---|---|
| 1Γ RTX 3090 (24 GB) | ~18 GB | β Fits β AWQ W4 runs on all GPUs |
| 1Γ RTX 4090 (24 GB) | ~18 GB | β Fits with room for KV cache |
| 1Γ RTX 5090 (32 GB) | ~18 GB | β Comfortable fit |
| 1Γ A100 (40/80 GB) | ~18 GB | β Plenty of headroom |
| 1Γ H100 (80 GB) | ~18 GB | β Ideal for long contexts |
AWQ W4A16 runs on all NVIDIA GPUs (no compute capability requirement unlike FP8/NVFP4). This includes RTX 3090, 3080, and older cards.
ποΈ Architecture Notes
Qwen3-30B-A3B is a standard transformer with Mixture-of-Experts (MoE) feed-forward layers:
- 48 transformer layers, each with a MoE FFN block
- 128 experts per layer, with top-8 routing per token
- ~3B parameters active per token out of 30B total
- Grouped Query Attention: 32 Q heads, 4 KV heads (8:1 GQA ratio)
- 262,144 native context length
The MoE architecture means 96%+ of parameters are in the expert MLP layers β exactly where AWQ smoothing is applied in this quantization.
β οΈ Important Notes
- π― Calibration language β calibrated on Italian data. The model retains its full multilingual capabilities (100+ languages), but quantization quality may be slightly optimized for Italian and similar Romance languages.
- π Sequence length β calibrated at 1024 tokens. The model supports up to 262K context but quantization statistics are optimized for this range.
- π§ vLLM recommended β
compressed-tensorsformat is natively supported by vLLM. Other inference engines may require conversion. - π§ Non-thinking mode only β this model does not generate
<think></think>blocks. For reasoning mode, use the Thinking variant. - π₯οΈ Broad GPU support β unlike FP8/NVFP4, AWQ W4A16 runs on any NVIDIA GPU including consumer cards (RTX 3090, 4090, etc.).
- π Benchmarks β coming soon. Community evaluations welcome.
π License
This model inherits the Apache 2.0 license from the base model.
Quantized with β€οΈ by Sophia AI
AWQ W4A16 β’ GQA-aware attention preservation β’ 128 experts fully calibrated β’ Ready for vLLM
- Downloads last month
- 125
Model tree for Sophia-AI/Qwen3-30B-A3B-Instruct-2507-AWQ-W4A16
Base model
Qwen/Qwen3-30B-A3B-Instruct-2507