⚑ Qwen3-30B-A3B-Instruct-2507-AWQ-W4A16

Quality-focused AWQ quantization of Qwen3-30B-A3B-Instruct-2507 with GQA-aware attention preservation β€” 61GB β†’ ~18GB.

A carefully optimized AWQ (W4A16) quantization of Qwen's updated Mixture-of-Experts instruct model (July 2025). Unlike standard AWQ quantizations of this model, this version preserves full-precision attention layers to avoid the quality degradation caused by unprotected quantization of GQA attention projections.


🎯 What Makes This Different

Most AWQ quantizations of GQA models (including those from official sources) silently produce 48 warnings per run:

skipping AWQ for model.layers.X.self_attn.v_proj ... incompatible balance layers

This happens because AWQ's smoothing algorithm cannot coordinate scaling between v_proj (4 KV heads, 512 dim) and o_proj (32 Q heads, 4096 dim) in Grouped Query Attention architectures. The result: o_proj gets quantized to W4 without any activation-aware protection β€” the worst-case scenario for quantization quality.

Our approach

We identified that when AWQ cannot perform its core function (activation-weighted channel protection) on the attention layers, it's better to keep them at full precision rather than apply blind quantization:

Component Standard AWQ This model
q/k/v/o_proj (attention) W4 without smoothing ⚠️ BF16 full precision βœ…
gate/up/down_proj (MLP experts) W4 with AWQ smoothing βœ… W4 with AWQ smoothing βœ…
MoE routing gates BF16 βœ… BF16 βœ…
lm_head BF16 βœ… BF16 βœ…

The MLP expert layers (gate_proj, up_proj, down_proj) β€” which represent 96%+ of the model's parameters β€” receive full AWQ treatment with both smoothing mappings intact:

  1. post_attention_layernorm β†’ [gate_proj, up_proj]
  2. up_proj β†’ [down_proj]

The attention layers (~1.3 GB) stay in BF16, adding minimal overhead for significantly better output quality.


πŸ—οΈ Model Overview

🧬 Architecture Qwen3-MoE β€” standard transformer with Mixture-of-Experts
πŸ“ Parameters 30B total, 3B active per token (128 experts, top-8 routing)
πŸ—œοΈ Quantization AWQ W4A16 (4-bit integer weights, 16-bit activations)
πŸ“¦ Size 18 GB (from 61 GB BF16) β€” **70% reduction**
πŸ”§ Format compressed-tensors β€” native vLLM support
πŸ“ Context 262,144 tokens natively
🧠 Attention BF16 full precision (GQA-aware preservation)
βš™οΈ MLP Experts W4 with full AWQ smoothing (18,432 modules)

πŸ†• What's New in 2507

This is the quantized version of the July 2025 update, which brings significant improvements over the original Qwen3-30B-A3B:

  • Improved instruction following, logical reasoning, math, science, coding and tool usage
  • Better long-tail knowledge coverage across multiple languages
  • Enhanced alignment with user preferences in open-ended tasks
  • Improved 256K long-context understanding

Note: This model supports only non-thinking mode and does not generate <think></think> blocks.


πŸš€ Quick Start

vLLM (recommended)

vllm serve Sophia-AI/Qwen3-30B-A3B-Instruct-2507-AWQ-W4A16

vLLM with Docker

docker run --gpus all \
    -p 8000:8000 \
    vllm/vllm-openai:latest \
    --model Sophia-AI/Qwen3-30B-A3B-Instruct-2507-AWQ-W4A16

Python (OpenAI-compatible API)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

response = client.chat.completions.create(
    model="Sophia-AI/Qwen3-30B-A3B-Instruct-2507-AWQ-W4A16",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain mixture-of-experts architectures in simple terms."},
    ],
    max_tokens=512,
)
print(response.choices[0].message.content)

Python (Transformers)

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Sophia-AI/Qwen3-30B-A3B-Instruct-2507-AWQ-W4A16",
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
    "Sophia-AI/Qwen3-30B-A3B-Instruct-2507-AWQ-W4A16"
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What are the key improvements in Qwen3's July 2025 update?"},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

πŸ”¬ Quantization Details

Method

AWQ (Activation-Aware Weight Quantization) using llmcompressor with GQA-aware custom mappings and selective layer preservation.

Calibration

πŸ“Š Samples 256
πŸ“ Sequence length 1024 tokens (p90-optimized)
🌍 Calibration language Italian
πŸ”€ MoE coverage All 128 experts calibrated (moe_calibrate_all_experts=True)
βš™οΈ Pipeline Sequential (layer-by-layer)
πŸ–₯️ Hardware 1Γ— NVIDIA B200 SXM (192 GB VRAM)

Preserved Layers (not quantized)

The following layers are kept in their original BF16 precision:

Pattern Count Reason
lm_head 1 Output projection β€” critical for token prediction
mlp.gate 48 MoE routing gates β€” high impact on expert selection
self_attn.q_proj 48 Query projection β€” GQA incompatible with AWQ smoothing
self_attn.k_proj 48 Key projection β€” GQA incompatible with AWQ smoothing
self_attn.v_proj 48 Value projection β€” GQA incompatible with AWQ smoothing
self_attn.o_proj 48 Output projection β€” no AWQ smoothing possible without v_proj coordination

A total of 241 modules are preserved in original precision. All remaining 18,432 MLP expert modules are quantized with full AWQ protection.

Why Preserve Attention?

Qwen3-30B-A3B uses Grouped Query Attention (32 Q heads, 4 KV heads). AWQ smoothing works by coordinating scaling between consecutive layers:

v_proj (512 dim) ──scale──→ o_proj (4096 dim)

With GQA, the dimension mismatch (8:1 ratio) breaks this coordination. AWQ cannot compute proper per-channel protection for o_proj. Rather than quantize it without protection β€” which other quantizations do β€” we keep the entire attention block in BF16 for consistent precision.


πŸ’» Hardware Requirements

Setup VRAM Notes
1Γ— RTX 3090 (24 GB) ~18 GB βœ… Fits β€” AWQ W4 runs on all GPUs
1Γ— RTX 4090 (24 GB) ~18 GB βœ… Fits with room for KV cache
1Γ— RTX 5090 (32 GB) ~18 GB βœ… Comfortable fit
1Γ— A100 (40/80 GB) ~18 GB βœ… Plenty of headroom
1Γ— H100 (80 GB) ~18 GB βœ… Ideal for long contexts

AWQ W4A16 runs on all NVIDIA GPUs (no compute capability requirement unlike FP8/NVFP4). This includes RTX 3090, 3080, and older cards.


πŸ›οΈ Architecture Notes

Qwen3-30B-A3B is a standard transformer with Mixture-of-Experts (MoE) feed-forward layers:

  • 48 transformer layers, each with a MoE FFN block
  • 128 experts per layer, with top-8 routing per token
  • ~3B parameters active per token out of 30B total
  • Grouped Query Attention: 32 Q heads, 4 KV heads (8:1 GQA ratio)
  • 262,144 native context length

The MoE architecture means 96%+ of parameters are in the expert MLP layers β€” exactly where AWQ smoothing is applied in this quantization.


⚠️ Important Notes

  • 🎯 Calibration language β€” calibrated on Italian data. The model retains its full multilingual capabilities (100+ languages), but quantization quality may be slightly optimized for Italian and similar Romance languages.
  • πŸ“ Sequence length β€” calibrated at 1024 tokens. The model supports up to 262K context but quantization statistics are optimized for this range.
  • πŸ”§ vLLM recommended β€” compressed-tensors format is natively supported by vLLM. Other inference engines may require conversion.
  • 🧠 Non-thinking mode only β€” this model does not generate <think></think> blocks. For reasoning mode, use the Thinking variant.
  • πŸ–₯️ Broad GPU support β€” unlike FP8/NVFP4, AWQ W4A16 runs on any NVIDIA GPU including consumer cards (RTX 3090, 4090, etc.).
  • πŸ“Š Benchmarks β€” coming soon. Community evaluations welcome.

πŸ“œ License

This model inherits the Apache 2.0 license from the base model.


Quantized with ❀️ by Sophia AI
AWQ W4A16 β€’ GQA-aware attention preservation β€’ 128 experts fully calibrated β€’ Ready for vLLM

Downloads last month
125
Safetensors
Model size
5B params
Tensor type
I64
Β·
I32
Β·
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Sophia-AI/Qwen3-30B-A3B-Instruct-2507-AWQ-W4A16

Quantized
(116)
this model