mlx-community/Qwen3.5-REAP-97B-A10B-4bit

This model is a 4-bit quantized version of OpenMOSE/Qwen3.5-REAP-97B-A10B converted to MLX format using mlx-lm.

Model Details

Property	Value
Original model	OpenMOSE/Qwen3.5-REAP-97B-A10B
Base model	Qwen/Qwen3.5-122B-A10B
Architecture	`qwen3_5_moe` (Decoder-only Transformer, hybrid linear/full attention + MoE)
Total parameters	~97B (pruned from 122B via REAP)
Active parameters	~10B per token (8 experts per token)
MoE experts	200 (pruned from 256)
Num layers	48
Hidden size	3072
Attention	Hybrid (3x linear + 1x full, repeating every 4 layers)
Max context length	262,144 tokens
Quantization	4-bit (affine, group size 64), gates kept at 8-bit
Model size on disk	~51 GB
License	Apache 2.0

About the Original Model

Qwen3.5-REAP-97B-A10B was created by applying Router-weighted Expert Activation Pruning (REAP) to Qwen3.5-122B-A10B. REAP uses router statistics and expert activation patterns to identify and prune under-used or redundant MoE experts. Roughly 22% of experts were pruned (256 to 200), reducing total parameters from 122B to 97B while maintaining the same ~10B active parameters per token.

This is an unofficial community variant, not affiliated with or endorsed by Alibaba or Cerebras Systems.

How to Use

Install

pip install mlx-lm

Generate

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Qwen3.5-REAP-97B-A10B-4bit")

prompt = "Explain the theory of relativity in simple terms."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

response = generate(model, tokenizer, prompt=text, max_tokens=512)
print(response)

CLI

mlx_lm.generate \
  --model mlx-community/Qwen3.5-REAP-97B-A10B-4bit \
  --prompt "What is 2+2?" \
  --max-tokens 256

Memory Requirements

At 4-bit quantization, this model requires approximately 54 GB of unified memory to load. It runs on Apple Silicon Macs with 64 GB or more of RAM (96 GB+ recommended for comfortable context lengths).

Conversion Details

Converted using mlx_lm.convert with -q --q-bits 4
Default affine quantization with group size 64
MoE gate/router weights kept at 8-bit precision for routing accuracy
Vision tower weights are excluded (text-only)

Limitations

This model inherits all limitations of the original:

Can hallucinate plausible but incorrect information
Biases from training data remain
Some long-tail domain knowledge may have degraded from expert pruning
Vision-language capabilities are not included in this MLX conversion
Not suitable for medical, legal, financial advice or safety-critical decisions without additional safeguards

Acknowledgements

Qwen team for the Qwen3.5 model family
OpenMOSE for REAP pruning and the 97B variant
Cerebras Research for the REAP method
Apple MLX team for the MLX framework

Downloads last month: 1,546

Safetensors

Model size

97B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for mlx-community/Qwen3.5-REAP-97B-A10B-4bit

Base model

Qwen/Qwen3.5-122B-A10B

Finetuned

OpenMOSE/Qwen3.5-REAP-97B-A10B

Quantized

(3)

this model