MiniMax-M2.5-REAP-139B-A10B-MLX-3.7bit

🚧 Experimental — Quality Degraded. This is an experimental quantization with known quality issues. The aggressive 3-bit compression causes significant performance loss in reasoning tasks and emoji/special character corruption. Released for research purposes only — not recommended for production use.

Mixed-precision MLX quantization of MiniMax-M2.5-REAP-139B-A10B — Cerebras' REAP-pruned MiniMax-M2.5 with 154 experts (pruned from 256), optimized for Apple Silicon local inference.

3.676 BPW | 60 GB | 139B total / 10B active per token

⚠️ Known Limitations

This quantization is released as an experimental / research artifact. The aggressive 3-bit compression of the expert FFN layers (93.9% of total parameters) results in noticeable quality degradation in complex reasoning scenarios. Users should be aware of the following:

Observed Quality Issues

Based on a standardized 10-question benchmark covering debugging, algorithms, security, concurrency, system design, and logic reasoning:

Issue	Severity	Details
Logic reasoning errors	🔴 Critical	Enumeration table contains calculation mistakes; final answers contradict self-verification steps
JS event loop misunderstanding	🔴 Critical	Incorrect microtask/macrotask ordering — placed `7` after `4` instead of before
System design confusion	🟡 Moderate	Answered "sliding window" question with a fixed-window `INCR + EXPIRE` scheme
Bug sensitivity	🟡 Moderate	Initially denied the existence of a classic integer overflow bug, then hedged
Security classification	🟠 Minor	Labeled SSTI (Server-Side Template Injection, enables RCE) as merely "XSS"
Emoji / special character corruption	🟡 Moderate	Emoji output is garbled or replaced with broken Unicode sequences, likely due to embedding layer quantization artifacts

Benchmark Score

Model	Score	Grade
Claude Opus 4.6 (reference)	98/100	S
Qwen3.5-122B v2 (3.7bit)	95/100	S
Qwen3.5-122B v1 (3.7bit)	90/100	A+
MiniMax-REAP (this model)	71/100	B

The quality issues are concentrated in high-difficulty questions requiring deep understanding and rigorous reasoning, suggesting that MiniMax-M2.5's expert FFN layers are particularly sensitive to sub-4-bit quantization. The model remains capable for straightforward tasks (algorithms, code transformation, basic Q&A scored full marks).

Research Value

Despite the quality limitations, this quantization may have some research value:

Feasibility demonstration: Proves that 139B MoE models can physically fit on 96GB Apple Silicon devices
Sensitivity analysis: Reveals that MiniMax' architecture is more quantization-sensitive than Qwen3.5's, particularly in expert FFN layers
Architecture comparison: While Qwen3.5-122B at 3.7bit retains 95/100 reasoning capability, MiniMax-139B at the same bit-rate drops to 71/100 — suggesting that quantization resilience varies significantly across architectures
Baseline for future work: Serves as a reference point for evaluating improved quantization techniques (GPTQ, AWQ, or calibration-based methods) on this architecture

📊 Benchmark (M2 Max 96GB)

While the reasoning quality is significantly impacted, the inference speed remains impressive for a 139B class model on local hardware:

Context	Prefill (PP)	Generation (TG)
1k	160.0 tok/s	27.9 tok/s
4k	166.5 tok/s	22.0 tok/s

Tested on M2 Max (38c GPU) with 96GB Unified Memory.

🚀 Hardware Requirements

The original FP8 model weighs 131GB, and BF16 would be an impractical 260GB. Through mixed-precision quantization, we've compressed it to 60GB:

96GB Unified Memory (Minimum): Fits the full model with moderate headroom for KV cache, enabling 139B-parameter local inference on a single Mac Studio or MacBook Pro.
128GB+ Unified Memory (Recommended): Delivers comfortable headroom for the KV cache, supporting longer context windows.

Quantization

4-tier mixed precision by functional sensitivity:

Bits	Modules	% Params	Description
BF16	—	<0.1%	Norm layers, router gate, routing bias — preserving routing stability
8-bit	—	~2.9%	Embeddings, lm_head, all q/k/v/o_proj — model entry/exit points and full attention
4-bit	—	~3.1%	Layer 0 & 61 expert FFN — entry/exit layer experts most sensitive to quantization
3-bit	—	~93.9%	Expert FFN w1/w2/w3 × 154 experts × 60 layers (8 active/token, 5.2% sparsity)

Optimization Philosophy

Three targeted upgrades over baseline 3-bit quantization, costing only ~1.4GB total:

embed/lm_head → 8-bit (+0.3GB): Every token passes through these layers. Quantization noise here corrupts all downstream computation (embed) or directly distorts output probabilities (lm_head).
All attention → 8-bit (+0.6GB): Attention is only 2% of parameters. Going from 3→8bit costs a trivial 0.6GB but provides near-lossless attention quality, effectively preventing repetition and degradation.
Entry/exit layer experts → 4-bit (+0.5GB): Layer 0's quantization errors propagate through all 61 subsequent layers. Layer 61's output feeds directly to lm_head with no opportunity for error correction.

Model Architecture

Parameter	Value
Architecture	Transformer MoE (MiniMaxM2ForCausalLM)
Total Parameters	139.1B
Active Parameters	~10B per token
Layers	62
Hidden Size	3,072
Attention	48 heads GQA (8 KV heads), head_dim=128
QK Norm	Per-layer RMSNorm
Experts	154 per layer (top-8 activated, sigmoid router)
Expert FFN	3,072 → 1,536 → 3,072 (SwiGLU)
Context Length	196,608 tokens
Vocabulary	200,064 tokens
RoPE	θ = 5,000,000, partial rotary (dim=64)

Usage

from mlx_lm import load, generate

model, tokenizer = load("MoringLabs/MiniMax-M2.5-REAP-139B-A10B-MLX-3.7bit")

response = generate(model, tokenizer, prompt="Hello!", max_tokens=200)
print(response)

Original Model

This quantization is based on cerebras/MiniMax-M2.5-REAP-139B-A10B, which was created by applying REAP (Router-weighted Expert Activation Pruning) to MiniMaxAI/MiniMax-M2.5 with a 40% pruning rate, reducing experts from 256 to 154 per layer while maintaining near-lossless performance.

License

Modified MIT (inherited from base model)

Downloads last month: 547

Safetensors

Model size

139B params

Tensor type

BF16

U32

MLX

Hardware compatibility

3-bit

Model tree for MoringLabs/MiniMax-M2.5-REAP-139B-A10B-MLX-3.7bit

Base model

MiniMaxAI/MiniMax-M2.5

Quantized

cerebras/MiniMax-M2.5-REAP-139B-A10B

Quantized

(11)

this model