⚡ Qwen3-30B-A3B-Instruct-2507-AWQ-W4A16

Quality-focused AWQ quantization of Qwen3-30B-A3B-Instruct-2507 with GQA-aware attention preservation — 61GB → ~18GB.

A carefully optimized AWQ (W4A16) quantization of Qwen's updated Mixture-of-Experts instruct model (July 2025). Unlike standard AWQ quantizations of this model, this version preserves full-precision attention layers to avoid the quality degradation caused by unprotected quantization of GQA attention projections.

🎯 What Makes This Different

Most AWQ quantizations of GQA models (including those from official sources) silently produce 48 warnings per run:

skipping AWQ for model.layers.X.self_attn.v_proj ... incompatible balance layers

This happens because AWQ's smoothing algorithm cannot coordinate scaling between v_proj (4 KV heads, 512 dim) and o_proj (32 Q heads, 4096 dim) in Grouped Query Attention architectures. The result: o_proj gets quantized to W4 without any activation-aware protection — the worst-case scenario for quantization quality.

Our approach

We identified that when AWQ cannot perform its core function (activation-weighted channel protection) on the attention layers, it's better to keep them at full precision rather than apply blind quantization:

Component	Standard AWQ	This model
q/k/v/o_proj (attention)	W4 without smoothing ⚠️	BF16 full precision ✅
gate/up/down_proj (MLP experts)	W4 with AWQ smoothing ✅	W4 with AWQ smoothing ✅
MoE routing gates	BF16 ✅	BF16 ✅
lm_head	BF16 ✅	BF16 ✅

The MLP expert layers (gate_proj, up_proj, down_proj) — which represent 96%+ of the model's parameters — receive full AWQ treatment with both smoothing mappings intact:

post_attention_layernorm → [gate_proj, up_proj]
up_proj → [down_proj]

The attention layers (~1.3 GB) stay in BF16, adding minimal overhead for significantly better output quality.

🏗️ Model Overview


🧬 Architecture	Qwen3-MoE — standard transformer with Mixture-of-Experts
📐 Parameters	30B total, 3B active per token (128 experts, top-8 routing)
🗜️ Quantization	AWQ W4A16 (4-bit integer weights, 16-bit activations)
📦 Size	18 GB (from 61 GB BF16) — 70% reduction
🔧 Format	`compressed-tensors` — native vLLM support
📏 Context	262,144 tokens natively
🧠 Attention	BF16 full precision (GQA-aware preservation)
⚙️ MLP Experts	W4 with full AWQ smoothing (18,432 modules)

🆕 What's New in 2507

This is the quantized version of the July 2025 update, which brings significant improvements over the original Qwen3-30B-A3B:

Improved instruction following, logical reasoning, math, science, coding and tool usage
Better long-tail knowledge coverage across multiple languages
Enhanced alignment with user preferences in open-ended tasks
Improved 256K long-context understanding

Note: This model supports only non-thinking mode and does not generate <think></think> blocks.

🚀 Quick Start

vLLM (recommended)

vllm serve Sophia-AI/Qwen3-30B-A3B-Instruct-2507-AWQ-W4A16

vLLM with Docker

docker run --gpus all \
    -p 8000:8000 \
    vllm/vllm-openai:latest \
    --model Sophia-AI/Qwen3-30B-A3B-Instruct-2507-AWQ-W4A16

Python (OpenAI-compatible API)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

response = client.chat.completions.create(
    model="Sophia-AI/Qwen3-30B-A3B-Instruct-2507-AWQ-W4A16",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain mixture-of-experts architectures in simple terms."},
    ],
    max_tokens=512,
)
print(response.choices[0].message.content)

Python (Transformers)

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Sophia-AI/Qwen3-30B-A3B-Instruct-2507-AWQ-W4A16",
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
    "Sophia-AI/Qwen3-30B-A3B-Instruct-2507-AWQ-W4A16"
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What are the key improvements in Qwen3's July 2025 update?"},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

🔬 Quantization Details

Method

AWQ (Activation-Aware Weight Quantization) using llmcompressor with GQA-aware custom mappings and selective layer preservation.

Calibration


📊 Samples	256
📏 Sequence length	1024 tokens (p90-optimized)
🌍 Calibration language	Italian
🔀 MoE coverage	All 128 experts calibrated (`moe_calibrate_all_experts=True`)
⚙️ Pipeline	Sequential (layer-by-layer)
🖥️ Hardware	1× NVIDIA B200 SXM (192 GB VRAM)

Preserved Layers (not quantized)

The following layers are kept in their original BF16 precision:

Pattern	Count	Reason
`lm_head`	1	Output projection — critical for token prediction
`mlp.gate`	48	MoE routing gates — high impact on expert selection
`self_attn.q_proj`	48	Query projection — GQA incompatible with AWQ smoothing
`self_attn.k_proj`	48	Key projection — GQA incompatible with AWQ smoothing
`self_attn.v_proj`	48	Value projection — GQA incompatible with AWQ smoothing
`self_attn.o_proj`	48	Output projection — no AWQ smoothing possible without v_proj coordination

A total of 241 modules are preserved in original precision. All remaining 18,432 MLP expert modules are quantized with full AWQ protection.

Why Preserve Attention?

Qwen3-30B-A3B uses Grouped Query Attention (32 Q heads, 4 KV heads). AWQ smoothing works by coordinating scaling between consecutive layers:

v_proj (512 dim) ──scale──→ o_proj (4096 dim)

With GQA, the dimension mismatch (8:1 ratio) breaks this coordination. AWQ cannot compute proper per-channel protection for o_proj. Rather than quantize it without protection — which other quantizations do — we keep the entire attention block in BF16 for consistent precision.

💻 Hardware Requirements

Setup	VRAM	Notes
1× RTX 3090 (24 GB)	~18 GB	✅ Fits — AWQ W4 runs on all GPUs
1× RTX 4090 (24 GB)	~18 GB	✅ Fits with room for KV cache
1× RTX 5090 (32 GB)	~18 GB	✅ Comfortable fit
1× A100 (40/80 GB)	~18 GB	✅ Plenty of headroom
1× H100 (80 GB)	~18 GB	✅ Ideal for long contexts

AWQ W4A16 runs on all NVIDIA GPUs (no compute capability requirement unlike FP8/NVFP4). This includes RTX 3090, 3080, and older cards.

🏛️ Architecture Notes

Qwen3-30B-A3B is a standard transformer with Mixture-of-Experts (MoE) feed-forward layers:

48 transformer layers, each with a MoE FFN block
128 experts per layer, with top-8 routing per token
~3B parameters active per token out of 30B total
Grouped Query Attention: 32 Q heads, 4 KV heads (8:1 GQA ratio)
262,144 native context length

The MoE architecture means 96%+ of parameters are in the expert MLP layers — exactly where AWQ smoothing is applied in this quantization.

⚠️ Important Notes

🎯 Calibration language — calibrated on Italian data. The model retains its full multilingual capabilities (100+ languages), but quantization quality may be slightly optimized for Italian and similar Romance languages.
📏 Sequence length — calibrated at 1024 tokens. The model supports up to 262K context but quantization statistics are optimized for this range.
🔧 vLLM recommended — compressed-tensors format is natively supported by vLLM. Other inference engines may require conversion.
🧠 Non-thinking mode only — this model does not generate <think></think> blocks. For reasoning mode, use the Thinking variant.
🖥️ Broad GPU support — unlike FP8/NVFP4, AWQ W4A16 runs on any NVIDIA GPU including consumer cards (RTX 3090, 4090, etc.).
📊 Benchmarks — coming soon. Community evaluations welcome.

📜 License

This model inherits the Apache 2.0 license from the base model.

Quantized with ❤️ by Sophia AI
AWQ W4A16 • GQA-aware attention preservation • 128 experts fully calibrated • Ready for vLLM

Downloads last month: 125

Safetensors

Model size

5B params

Tensor type

I64

I32

BF16

Model tree for Sophia-AI/Qwen3-30B-A3B-Instruct-2507-AWQ-W4A16

Base model

Qwen/Qwen3-30B-A3B-Instruct-2507

Quantized

(116)

this model