MiniMax-M2.5-REAP-139B-A10B-MLX-3.7bit

🚧 Experimental — Quality Degraded. This is an experimental quantization with known quality issues. The aggressive 3-bit compression causes significant performance loss in reasoning tasks and emoji/special character corruption. Released for research purposes only — not recommended for production use.

Mixed-precision MLX quantization of MiniMax-M2.5-REAP-139B-A10B — Cerebras' REAP-pruned MiniMax-M2.5 with 154 experts (pruned from 256), optimized for Apple Silicon local inference.

  • 3.676 BPW | 60 GB | 139B total / 10B active per token

⚠️ Known Limitations

This quantization is released as an experimental / research artifact. The aggressive 3-bit compression of the expert FFN layers (93.9% of total parameters) results in noticeable quality degradation in complex reasoning scenarios. Users should be aware of the following:

Observed Quality Issues

Based on a standardized 10-question benchmark covering debugging, algorithms, security, concurrency, system design, and logic reasoning:

Issue Severity Details
Logic reasoning errors 🔴 Critical Enumeration table contains calculation mistakes; final answers contradict self-verification steps
JS event loop misunderstanding 🔴 Critical Incorrect microtask/macrotask ordering — placed 7 after 4 instead of before
System design confusion 🟡 Moderate Answered "sliding window" question with a fixed-window INCR + EXPIRE scheme
Bug sensitivity 🟡 Moderate Initially denied the existence of a classic integer overflow bug, then hedged
Security classification 🟠 Minor Labeled SSTI (Server-Side Template Injection, enables RCE) as merely "XSS"
Emoji / special character corruption 🟡 Moderate Emoji output is garbled or replaced with broken Unicode sequences, likely due to embedding layer quantization artifacts

Benchmark Score

Model Score Grade
Claude Opus 4.6 (reference) 98/100 S
Qwen3.5-122B v2 (3.7bit) 95/100 S
Qwen3.5-122B v1 (3.7bit) 90/100 A+
MiniMax-REAP (this model) 71/100 B

The quality issues are concentrated in high-difficulty questions requiring deep understanding and rigorous reasoning, suggesting that MiniMax-M2.5's expert FFN layers are particularly sensitive to sub-4-bit quantization. The model remains capable for straightforward tasks (algorithms, code transformation, basic Q&A scored full marks).

Research Value

Despite the quality limitations, this quantization may have some research value:

  1. Feasibility demonstration: Proves that 139B MoE models can physically fit on 96GB Apple Silicon devices
  2. Sensitivity analysis: Reveals that MiniMax' architecture is more quantization-sensitive than Qwen3.5's, particularly in expert FFN layers
  3. Architecture comparison: While Qwen3.5-122B at 3.7bit retains 95/100 reasoning capability, MiniMax-139B at the same bit-rate drops to 71/100 — suggesting that quantization resilience varies significantly across architectures
  4. Baseline for future work: Serves as a reference point for evaluating improved quantization techniques (GPTQ, AWQ, or calibration-based methods) on this architecture

📊 Benchmark (M2 Max 96GB)

While the reasoning quality is significantly impacted, the inference speed remains impressive for a 139B class model on local hardware:

Context Prefill (PP) Generation (TG)
1k 160.0 tok/s 27.9 tok/s
4k 166.5 tok/s 22.0 tok/s

Tested on M2 Max (38c GPU) with 96GB Unified Memory.

🚀 Hardware Requirements

The original FP8 model weighs 131GB, and BF16 would be an impractical 260GB. Through mixed-precision quantization, we've compressed it to 60GB:

  • 96GB Unified Memory (Minimum): Fits the full model with moderate headroom for KV cache, enabling 139B-parameter local inference on a single Mac Studio or MacBook Pro.
  • 128GB+ Unified Memory (Recommended): Delivers comfortable headroom for the KV cache, supporting longer context windows.

Quantization

4-tier mixed precision by functional sensitivity:

Bits Modules % Params Description
BF16 <0.1% Norm layers, router gate, routing bias — preserving routing stability
8-bit ~2.9% Embeddings, lm_head, all q/k/v/o_proj — model entry/exit points and full attention
4-bit ~3.1% Layer 0 & 61 expert FFN — entry/exit layer experts most sensitive to quantization
3-bit ~93.9% Expert FFN w1/w2/w3 × 154 experts × 60 layers (8 active/token, 5.2% sparsity)

Optimization Philosophy

Three targeted upgrades over baseline 3-bit quantization, costing only ~1.4GB total:

  1. embed/lm_head → 8-bit (+0.3GB): Every token passes through these layers. Quantization noise here corrupts all downstream computation (embed) or directly distorts output probabilities (lm_head).
  2. All attention → 8-bit (+0.6GB): Attention is only 2% of parameters. Going from 3→8bit costs a trivial 0.6GB but provides near-lossless attention quality, effectively preventing repetition and degradation.
  3. Entry/exit layer experts → 4-bit (+0.5GB): Layer 0's quantization errors propagate through all 61 subsequent layers. Layer 61's output feeds directly to lm_head with no opportunity for error correction.

Model Architecture

Parameter Value
Architecture Transformer MoE (MiniMaxM2ForCausalLM)
Total Parameters 139.1B
Active Parameters ~10B per token
Layers 62
Hidden Size 3,072
Attention 48 heads GQA (8 KV heads), head_dim=128
QK Norm Per-layer RMSNorm
Experts 154 per layer (top-8 activated, sigmoid router)
Expert FFN 3,072 → 1,536 → 3,072 (SwiGLU)
Context Length 196,608 tokens
Vocabulary 200,064 tokens
RoPE θ = 5,000,000, partial rotary (dim=64)

Usage

from mlx_lm import load, generate

model, tokenizer = load("MoringLabs/MiniMax-M2.5-REAP-139B-A10B-MLX-3.7bit")

response = generate(model, tokenizer, prompt="Hello!", max_tokens=200)
print(response)

Original Model

This quantization is based on cerebras/MiniMax-M2.5-REAP-139B-A10B, which was created by applying REAP (Router-weighted Expert Activation Pruning) to MiniMaxAI/MiniMax-M2.5 with a 40% pruning rate, reducing experts from 256 to 154 per layer while maintaining near-lossless performance.

License

Modified MIT (inherited from base model)

Downloads last month
547
Safetensors
Model size
139B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

3-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MoringLabs/MiniMax-M2.5-REAP-139B-A10B-MLX-3.7bit

Quantized
(11)
this model