MiniMax-M2.5-REAP-139B-A10B-MLX-3.7bit
🚧 Experimental — Quality Degraded. This is an experimental quantization with known quality issues. The aggressive 3-bit compression causes significant performance loss in reasoning tasks and emoji/special character corruption. Released for research purposes only — not recommended for production use.
Mixed-precision MLX quantization of MiniMax-M2.5-REAP-139B-A10B — Cerebras' REAP-pruned MiniMax-M2.5 with 154 experts (pruned from 256), optimized for Apple Silicon local inference.
- 3.676 BPW | 60 GB | 139B total / 10B active per token
⚠️ Known Limitations
This quantization is released as an experimental / research artifact. The aggressive 3-bit compression of the expert FFN layers (93.9% of total parameters) results in noticeable quality degradation in complex reasoning scenarios. Users should be aware of the following:
Observed Quality Issues
Based on a standardized 10-question benchmark covering debugging, algorithms, security, concurrency, system design, and logic reasoning:
| Issue | Severity | Details |
|---|---|---|
| Logic reasoning errors | 🔴 Critical | Enumeration table contains calculation mistakes; final answers contradict self-verification steps |
| JS event loop misunderstanding | 🔴 Critical | Incorrect microtask/macrotask ordering — placed 7 after 4 instead of before |
| System design confusion | 🟡 Moderate | Answered "sliding window" question with a fixed-window INCR + EXPIRE scheme |
| Bug sensitivity | 🟡 Moderate | Initially denied the existence of a classic integer overflow bug, then hedged |
| Security classification | 🟠 Minor | Labeled SSTI (Server-Side Template Injection, enables RCE) as merely "XSS" |
| Emoji / special character corruption | 🟡 Moderate | Emoji output is garbled or replaced with broken Unicode sequences, likely due to embedding layer quantization artifacts |
Benchmark Score
| Model | Score | Grade |
|---|---|---|
| Claude Opus 4.6 (reference) | 98/100 | S |
| Qwen3.5-122B v2 (3.7bit) | 95/100 | S |
| Qwen3.5-122B v1 (3.7bit) | 90/100 | A+ |
| MiniMax-REAP (this model) | 71/100 | B |
The quality issues are concentrated in high-difficulty questions requiring deep understanding and rigorous reasoning, suggesting that MiniMax-M2.5's expert FFN layers are particularly sensitive to sub-4-bit quantization. The model remains capable for straightforward tasks (algorithms, code transformation, basic Q&A scored full marks).
Research Value
Despite the quality limitations, this quantization may have some research value:
- Feasibility demonstration: Proves that 139B MoE models can physically fit on 96GB Apple Silicon devices
- Sensitivity analysis: Reveals that MiniMax' architecture is more quantization-sensitive than Qwen3.5's, particularly in expert FFN layers
- Architecture comparison: While Qwen3.5-122B at 3.7bit retains 95/100 reasoning capability, MiniMax-139B at the same bit-rate drops to 71/100 — suggesting that quantization resilience varies significantly across architectures
- Baseline for future work: Serves as a reference point for evaluating improved quantization techniques (GPTQ, AWQ, or calibration-based methods) on this architecture
📊 Benchmark (M2 Max 96GB)
While the reasoning quality is significantly impacted, the inference speed remains impressive for a 139B class model on local hardware:
| Context | Prefill (PP) | Generation (TG) |
|---|---|---|
| 1k | 160.0 tok/s | 27.9 tok/s |
| 4k | 166.5 tok/s | 22.0 tok/s |
Tested on M2 Max (38c GPU) with 96GB Unified Memory.
🚀 Hardware Requirements
The original FP8 model weighs 131GB, and BF16 would be an impractical 260GB. Through mixed-precision quantization, we've compressed it to 60GB:
- 96GB Unified Memory (Minimum): Fits the full model with moderate headroom for KV cache, enabling 139B-parameter local inference on a single Mac Studio or MacBook Pro.
- 128GB+ Unified Memory (Recommended): Delivers comfortable headroom for the KV cache, supporting longer context windows.
Quantization
4-tier mixed precision by functional sensitivity:
| Bits | Modules | % Params | Description |
|---|---|---|---|
| BF16 | — | <0.1% | Norm layers, router gate, routing bias — preserving routing stability |
| 8-bit | — | ~2.9% | Embeddings, lm_head, all q/k/v/o_proj — model entry/exit points and full attention |
| 4-bit | — | ~3.1% | Layer 0 & 61 expert FFN — entry/exit layer experts most sensitive to quantization |
| 3-bit | — | ~93.9% | Expert FFN w1/w2/w3 × 154 experts × 60 layers (8 active/token, 5.2% sparsity) |
Optimization Philosophy
Three targeted upgrades over baseline 3-bit quantization, costing only ~1.4GB total:
- embed/lm_head → 8-bit (+0.3GB): Every token passes through these layers. Quantization noise here corrupts all downstream computation (embed) or directly distorts output probabilities (lm_head).
- All attention → 8-bit (+0.6GB): Attention is only 2% of parameters. Going from 3→8bit costs a trivial 0.6GB but provides near-lossless attention quality, effectively preventing repetition and degradation.
- Entry/exit layer experts → 4-bit (+0.5GB): Layer 0's quantization errors propagate through all 61 subsequent layers. Layer 61's output feeds directly to lm_head with no opportunity for error correction.
Model Architecture
| Parameter | Value |
|---|---|
| Architecture | Transformer MoE (MiniMaxM2ForCausalLM) |
| Total Parameters | 139.1B |
| Active Parameters | ~10B per token |
| Layers | 62 |
| Hidden Size | 3,072 |
| Attention | 48 heads GQA (8 KV heads), head_dim=128 |
| QK Norm | Per-layer RMSNorm |
| Experts | 154 per layer (top-8 activated, sigmoid router) |
| Expert FFN | 3,072 → 1,536 → 3,072 (SwiGLU) |
| Context Length | 196,608 tokens |
| Vocabulary | 200,064 tokens |
| RoPE | θ = 5,000,000, partial rotary (dim=64) |
Usage
from mlx_lm import load, generate
model, tokenizer = load("MoringLabs/MiniMax-M2.5-REAP-139B-A10B-MLX-3.7bit")
response = generate(model, tokenizer, prompt="Hello!", max_tokens=200)
print(response)
Original Model
This quantization is based on cerebras/MiniMax-M2.5-REAP-139B-A10B, which was created by applying REAP (Router-weighted Expert Activation Pruning) to MiniMaxAI/MiniMax-M2.5 with a 40% pruning rate, reducing experts from 256 to 154 per layer while maintaining near-lossless performance.
License
Modified MIT (inherited from base model)
- Downloads last month
- 547
3-bit
Model tree for MoringLabs/MiniMax-M2.5-REAP-139B-A10B-MLX-3.7bit
Base model
MiniMaxAI/MiniMax-M2.5