mlx-community/Qwen3.5-REAP-97B-A10B-4bit
This model is a 4-bit quantized version of OpenMOSE/Qwen3.5-REAP-97B-A10B converted to MLX format using mlx-lm.
Model Details
| Property | Value |
|---|---|
| Original model | OpenMOSE/Qwen3.5-REAP-97B-A10B |
| Base model | Qwen/Qwen3.5-122B-A10B |
| Architecture | qwen3_5_moe (Decoder-only Transformer, hybrid linear/full attention + MoE) |
| Total parameters | ~97B (pruned from 122B via REAP) |
| Active parameters | ~10B per token (8 experts per token) |
| MoE experts | 200 (pruned from 256) |
| Num layers | 48 |
| Hidden size | 3072 |
| Attention | Hybrid (3x linear + 1x full, repeating every 4 layers) |
| Max context length | 262,144 tokens |
| Quantization | 4-bit (affine, group size 64), gates kept at 8-bit |
| Model size on disk | ~51 GB |
| License | Apache 2.0 |
About the Original Model
Qwen3.5-REAP-97B-A10B was created by applying Router-weighted Expert Activation Pruning (REAP) to Qwen3.5-122B-A10B. REAP uses router statistics and expert activation patterns to identify and prune under-used or redundant MoE experts. Roughly 22% of experts were pruned (256 to 200), reducing total parameters from 122B to 97B while maintaining the same ~10B active parameters per token.
This is an unofficial community variant, not affiliated with or endorsed by Alibaba or Cerebras Systems.
How to Use
Install
pip install mlx-lm
Generate
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Qwen3.5-REAP-97B-A10B-4bit")
prompt = "Explain the theory of relativity in simple terms."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=text, max_tokens=512)
print(response)
CLI
mlx_lm.generate \
--model mlx-community/Qwen3.5-REAP-97B-A10B-4bit \
--prompt "What is 2+2?" \
--max-tokens 256
Memory Requirements
At 4-bit quantization, this model requires approximately 54 GB of unified memory to load. It runs on Apple Silicon Macs with 64 GB or more of RAM (96 GB+ recommended for comfortable context lengths).
Conversion Details
- Converted using
mlx_lm.convertwith-q --q-bits 4 - Default affine quantization with group size 64
- MoE gate/router weights kept at 8-bit precision for routing accuracy
- Vision tower weights are excluded (text-only)
Limitations
This model inherits all limitations of the original:
- Can hallucinate plausible but incorrect information
- Biases from training data remain
- Some long-tail domain knowledge may have degraded from expert pruning
- Vision-language capabilities are not included in this MLX conversion
- Not suitable for medical, legal, financial advice or safety-critical decisions without additional safeguards
Acknowledgements
- Qwen team for the Qwen3.5 model family
- OpenMOSE for REAP pruning and the 97B variant
- Cerebras Research for the REAP method
- Apple MLX team for the MLX framework
- Downloads last month
- 1,546
4-bit