mlx-community/Qwen3.5-REAP-97B-A10B-4bit

This model is a 4-bit quantized version of OpenMOSE/Qwen3.5-REAP-97B-A10B converted to MLX format using mlx-lm.

Model Details

Property Value
Original model OpenMOSE/Qwen3.5-REAP-97B-A10B
Base model Qwen/Qwen3.5-122B-A10B
Architecture qwen3_5_moe (Decoder-only Transformer, hybrid linear/full attention + MoE)
Total parameters ~97B (pruned from 122B via REAP)
Active parameters ~10B per token (8 experts per token)
MoE experts 200 (pruned from 256)
Num layers 48
Hidden size 3072
Attention Hybrid (3x linear + 1x full, repeating every 4 layers)
Max context length 262,144 tokens
Quantization 4-bit (affine, group size 64), gates kept at 8-bit
Model size on disk ~51 GB
License Apache 2.0

About the Original Model

Qwen3.5-REAP-97B-A10B was created by applying Router-weighted Expert Activation Pruning (REAP) to Qwen3.5-122B-A10B. REAP uses router statistics and expert activation patterns to identify and prune under-used or redundant MoE experts. Roughly 22% of experts were pruned (256 to 200), reducing total parameters from 122B to 97B while maintaining the same ~10B active parameters per token.

This is an unofficial community variant, not affiliated with or endorsed by Alibaba or Cerebras Systems.

How to Use

Install

pip install mlx-lm

Generate

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Qwen3.5-REAP-97B-A10B-4bit")

prompt = "Explain the theory of relativity in simple terms."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

response = generate(model, tokenizer, prompt=text, max_tokens=512)
print(response)

CLI

mlx_lm.generate \
  --model mlx-community/Qwen3.5-REAP-97B-A10B-4bit \
  --prompt "What is 2+2?" \
  --max-tokens 256

Memory Requirements

At 4-bit quantization, this model requires approximately 54 GB of unified memory to load. It runs on Apple Silicon Macs with 64 GB or more of RAM (96 GB+ recommended for comfortable context lengths).

Conversion Details

  • Converted using mlx_lm.convert with -q --q-bits 4
  • Default affine quantization with group size 64
  • MoE gate/router weights kept at 8-bit precision for routing accuracy
  • Vision tower weights are excluded (text-only)

Limitations

This model inherits all limitations of the original:

  • Can hallucinate plausible but incorrect information
  • Biases from training data remain
  • Some long-tail domain knowledge may have degraded from expert pruning
  • Vision-language capabilities are not included in this MLX conversion
  • Not suitable for medical, legal, financial advice or safety-critical decisions without additional safeguards

Acknowledgements

Downloads last month
1,546
Safetensors
Model size
97B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/Qwen3.5-REAP-97B-A10B-4bit

Quantized
(3)
this model