Qwen3-REAP-15B-A3B-W4A16

INT4 weight-quantized version of Qwen3-REAP-15B-A3B, created by applying REAP pruning + AutoRound W4A16 quantization to Qwen3-30B-A3B.

30B → 15B (REAP 50% prune) → 8.7 GB (W4A16) | ~3B active per token

Model Summary

	Original	REAP Pruned	Quantized (this model)
Model	Qwen/Qwen3-30B-A3B	atbender/Qwen3-REAP-15B-A3B	Qwen3-REAP-15B-A3B-W4A16
Total Parameters	~30B	~15B	~15B
Active Parameters	~3B	~3B	~3B
Experts per Layer	128	64	64
Experts Routed per Token	8	8	8
Hidden Layers	48	48	48
Hidden Size	2048	2048	2048
Precision	BF16	BF16	W4A16 (INT4 weights, FP16 activations)
Disk Size	~57 GB	~30 GB	~8.7 GB

Compression Pipeline

Stage 1: REAP Expert Pruning (30B → 15B)

REAP (Router-weighted Expert Activation Pruning) from Cerebras Research combines router gate statistics with expert activation norms to determine which experts to prune. 50% of MoE experts pruned globally (128 → 64 per layer, across all 48 layers).

REAP calibration data — 1,000 samples, packed to 2,048 token sequences:

Source	Proportion	Dataset	Description
Agentic trajectories	40%	`togethercomputer/CoderForge-Preview`	Passing SWE-agent trajectories
Raw code	30%	`bigcode/the-stack-smol` (Python)	Python source code
General web text	10%	`allenai/c4` (English)	Pretraining distribution proxy
Broad coverage	20%	`NeelNanda/pile-10k`	Mixed general text

prune_method: reap
compression_ratio: 0.5
seed: 42
distance_measure: cosine
samples_per_category: 1024
model_max_length: 2048

Stage 2: AutoRound W4A16 Quantization (30 GB → 8.7 GB)

AutoRound from Intel uses signed gradient descent to find optimal rounding decisions — iteratively adjusting rounding directions to minimize output error across calibration samples.

Property	Value
Quantization tool	Intel AutoRound v0.10.2
Bits	4 (INT4 symmetric)
Group size	128
Format	auto_round (GPTQ-compatible packing)
Calibration dataset	`NeelNanda/pile-10k`
Calibration samples	64
Sequence length	512
Quantized layers	196/197 per block (9,408/9,456 total)
Skipped	`mlp.gate` (MoE router) — kept at FP16 to preserve routing precision

How to Use

With transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "atbender/Qwen3-REAP-15B-A3B-W4A16"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [{"role": "user", "content": "Write a Python function to compute fibonacci numbers."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True))

With vLLM

vllm serve atbender/Qwen3-REAP-15B-A3B-W4A16 \
  --quantization gptq \
  --trust-remote-code

How It Was Made

The Fun Part

from auto_round import AutoRound
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "atbender/Qwen3-REAP-15B-A3B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "atbender/Qwen3-REAP-15B-A3B",
    trust_remote_code=True,
)

ar = AutoRound(
    model,
    tokenizer=tokenizer,
    device="cuda",
    bits=4,
    group_size=128,
    nsamples=64,
    seqlen=512,
    batch_size=1,
)
ar.quantize_and_save("./Qwen3-REAP-15B-A3B-W4A16", format="auto_round")

The Less Fun Part (Patches for Forward Compatibility)

These patches are needed for transformers 5.x — they are no-ops on 4.55 but included for reproducibility:

import types, torch, transformers

# Patch 1: Conv1D removed in transformers 5.x
if not hasattr(transformers, 'pytorch_utils'):
    pytorch_utils = types.ModuleType('pytorch_utils')
    pytorch_utils.Conv1D = torch.nn.Linear
    transformers.pytorch_utils = pytorch_utils

# Patch 2: Qwen can be misdetected as multimodal
import auto_round.utils
import auto_round.autoround

auto_round.utils.is_mllm_model = lambda *args, **kwargs: False
auto_round.autoround.is_mllm_model = lambda *args, **kwargs: False

# Both patches must come BEFORE: from auto_round import AutoRound

Hardware & Runtime

Stage	Time	Hardware
REAP observation	~74 min	2× RTX A6000 48GB
REAP pruning	<1 sec	CPU
AutoRound quantization (48 blocks)	~46 min	2× RTX A6000 48GB
Total pipeline	~2 hours

Resource	Peak
VRAM	14.94 GB
RAM	33.56 GB

Intended Use

Research on MoE pruning and compression techniques
Practice run / reference for larger MoE compression (e.g., Step-3.5-Flash)
Exploring sparsity–performance trade-offs in MoE architectures
Local / single-GPU deployment of a compressed Qwen3 MoE variant

Limitations

No post-pruning fine-tuning — raw prune + quantize, quality degradation expected
Aggressive compression — 50% expert removal + 4-bit quantization is significant
Calibration bias — REAP calibration was 70% code-focused; quantization used general text (pile-10k)
Not benchmarked — no formal evals run yet; contributions welcome

Acknowledgements

Qwen team — Qwen3-30B-A3B base model
Cerebras Research — REAP method
Intel — AutoRound quantization framework
OpenMOSE — Reference implementation and model card inspiration

Citation

@article{lasby2025reap,
  title={REAP: Router-weighted Expert Activation Pruning for Scalable Mixture-of-Experts Compression},
  author={Lasby, Mike and others},
  year={2025},
  url={https://github.com/CerebrasResearch/reap}
}

@misc{autoround2024,
  title={AutoRound: Advanced Weight Quantization},
  author={Intel Corporation},
  year={2024},
  howpublished={\url{https://github.com/intel/auto-round}}
}

License

Apache License 2.0 — same as the base Qwen3-30B-A3B model.

Downloads last month: 5

Safetensors

Model size

0.6B params

Tensor type

I32

BF16

F16

Model tree for atbender/Qwen3-REAP-15B-A3B-W4A16

Base model

Qwen/Qwen3-30B-A3B-Base

Finetuned

Qwen/Qwen3-30B-A3B

Quantized

(111)

this model