Qwen3-REAP-15B-A3B-W4A16

INT4 weight-quantized version of Qwen3-REAP-15B-A3B, created by applying REAP pruning + AutoRound W4A16 quantization to Qwen3-30B-A3B.

30B โ†’ 15B (REAP 50% prune) โ†’ 8.7 GB (W4A16) | ~3B active per token


Model Summary

Original REAP Pruned Quantized (this model)
Model Qwen/Qwen3-30B-A3B atbender/Qwen3-REAP-15B-A3B Qwen3-REAP-15B-A3B-W4A16
Total Parameters ~30B ~15B ~15B
Active Parameters ~3B ~3B ~3B
Experts per Layer 128 64 64
Experts Routed per Token 8 8 8
Hidden Layers 48 48 48
Hidden Size 2048 2048 2048
Precision BF16 BF16 W4A16 (INT4 weights, FP16 activations)
Disk Size ~57 GB ~30 GB ~8.7 GB

Compression Pipeline

Stage 1: REAP Expert Pruning (30B โ†’ 15B)

REAP (Router-weighted Expert Activation Pruning) from Cerebras Research combines router gate statistics with expert activation norms to determine which experts to prune. 50% of MoE experts pruned globally (128 โ†’ 64 per layer, across all 48 layers).

REAP calibration data โ€” 1,000 samples, packed to 2,048 token sequences:

Source Proportion Dataset Description
Agentic trajectories 40% togethercomputer/CoderForge-Preview Passing SWE-agent trajectories
Raw code 30% bigcode/the-stack-smol (Python) Python source code
General web text 10% allenai/c4 (English) Pretraining distribution proxy
Broad coverage 20% NeelNanda/pile-10k Mixed general text
prune_method: reap
compression_ratio: 0.5
seed: 42
distance_measure: cosine
samples_per_category: 1024
model_max_length: 2048

Stage 2: AutoRound W4A16 Quantization (30 GB โ†’ 8.7 GB)

AutoRound from Intel uses signed gradient descent to find optimal rounding decisions โ€” iteratively adjusting rounding directions to minimize output error across calibration samples.

Property Value
Quantization tool Intel AutoRound v0.10.2
Bits 4 (INT4 symmetric)
Group size 128
Format auto_round (GPTQ-compatible packing)
Calibration dataset NeelNanda/pile-10k
Calibration samples 64
Sequence length 512
Quantized layers 196/197 per block (9,408/9,456 total)
Skipped mlp.gate (MoE router) โ€” kept at FP16 to preserve routing precision

How to Use

With transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "atbender/Qwen3-REAP-15B-A3B-W4A16"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [{"role": "user", "content": "Write a Python function to compute fibonacci numbers."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True))

With vLLM

vllm serve atbender/Qwen3-REAP-15B-A3B-W4A16 \
  --quantization gptq \
  --trust-remote-code

How It Was Made

The Fun Part

from auto_round import AutoRound
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "atbender/Qwen3-REAP-15B-A3B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "atbender/Qwen3-REAP-15B-A3B",
    trust_remote_code=True,
)

ar = AutoRound(
    model,
    tokenizer=tokenizer,
    device="cuda",
    bits=4,
    group_size=128,
    nsamples=64,
    seqlen=512,
    batch_size=1,
)
ar.quantize_and_save("./Qwen3-REAP-15B-A3B-W4A16", format="auto_round")

The Less Fun Part (Patches for Forward Compatibility)

These patches are needed for transformers 5.x โ€” they are no-ops on 4.55 but included for reproducibility:

import types, torch, transformers

# Patch 1: Conv1D removed in transformers 5.x
if not hasattr(transformers, 'pytorch_utils'):
    pytorch_utils = types.ModuleType('pytorch_utils')
    pytorch_utils.Conv1D = torch.nn.Linear
    transformers.pytorch_utils = pytorch_utils

# Patch 2: Qwen can be misdetected as multimodal
import auto_round.utils
import auto_round.autoround

auto_round.utils.is_mllm_model = lambda *args, **kwargs: False
auto_round.autoround.is_mllm_model = lambda *args, **kwargs: False

# Both patches must come BEFORE: from auto_round import AutoRound

Hardware & Runtime

Stage Time Hardware
REAP observation ~74 min 2ร— RTX A6000 48GB
REAP pruning <1 sec CPU
AutoRound quantization (48 blocks) ~46 min 2ร— RTX A6000 48GB
Total pipeline ~2 hours
Resource Peak
VRAM 14.94 GB
RAM 33.56 GB

Intended Use

  • Research on MoE pruning and compression techniques
  • Practice run / reference for larger MoE compression (e.g., Step-3.5-Flash)
  • Exploring sparsityโ€“performance trade-offs in MoE architectures
  • Local / single-GPU deployment of a compressed Qwen3 MoE variant

Limitations

  • No post-pruning fine-tuning โ€” raw prune + quantize, quality degradation expected
  • Aggressive compression โ€” 50% expert removal + 4-bit quantization is significant
  • Calibration bias โ€” REAP calibration was 70% code-focused; quantization used general text (pile-10k)
  • Not benchmarked โ€” no formal evals run yet; contributions welcome

Acknowledgements

  • Qwen team โ€” Qwen3-30B-A3B base model
  • Cerebras Research โ€” REAP method
  • Intel โ€” AutoRound quantization framework
  • OpenMOSE โ€” Reference implementation and model card inspiration

Citation

@article{lasby2025reap,
  title={REAP: Router-weighted Expert Activation Pruning for Scalable Mixture-of-Experts Compression},
  author={Lasby, Mike and others},
  year={2025},
  url={https://github.com/CerebrasResearch/reap}
}

@misc{autoround2024,
  title={AutoRound: Advanced Weight Quantization},
  author={Intel Corporation},
  year={2024},
  howpublished={\url{https://github.com/intel/auto-round}}
}

License

Apache License 2.0 โ€” same as the base Qwen3-30B-A3B model.

Downloads last month
5
Safetensors
Model size
0.6B params
Tensor type
I32
ยท
BF16
ยท
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for atbender/Qwen3-REAP-15B-A3B-W4A16

Quantized
(111)
this model