MiniMax-M2.5-oQ4

A 4-bit mixed-precision MLX quantization of MiniMaxAI/MiniMax-M2.5 using oQ โ€” a data-driven, sensitivity-aware quantization system for Apple Silicon. Produces standard MLX safetensors compatible with omlx, mlx-lm, and any app that supports MLX format.

Model Details

Model Description

MiniMax-M2.5 is a 229B-parameter sparse Mixture-of-Experts model with 45.9B active parameters, trained extensively with reinforcement learning across hundreds of thousands of real-world environments. It achieves state-of-the-art results on coding, agentic tool use, search, and office productivity tasks โ€” reaching 80.2% on SWE-Bench Verified and 76.3% on BrowseComp.

This repository contains an oQ4 quantization: 4.57 bits-per-weight mixed-precision, produced via omlx's sensitivity-driven streaming quantizer. Unlike uniform 4-bit quantization, oQ measures per-layer quantization sensitivity through calibration inference and allocates bits where they matter most โ€” critical layers (embeddings, LM head, the most sensitive transformer layers) are automatically promoted to 8-bit, while less sensitive layers stay at 4-bit.

  • Developed by: chevron7
  • Shared by: chevron7
  • Model type: Sparse Mixture-of-Experts (MoE) language model, MLX quantized
  • Language(s): Multilingual โ€” trained on 10+ programming languages and broad natural language coverage
  • License: MIT
  • Base model: MiniMaxAI/MiniMax-M2.5
  • Quantization tool: omlx v0.3.0

Model Sources

Uses

Direct Use

Intended for local inference on Apple Silicon Macs with sufficient unified memory. Drop into any omlx model directory or use directly with mlx-lm. Well-suited for:

  • Agentic coding and software engineering tasks
  • Long-context reasoning and planning
  • Tool calling and multi-step task execution
  • Multilingual tasks across 10+ languages

Downstream Use

Compatible with any OpenAI-compatible client via omlx's server (/v1/chat/completions), Claude Code via omlx's Claude Code integration, and mlx-lm's standard generate/chat/server interfaces.

Out-of-Scope Use

  • Requires Apple Silicon (M1 or later) โ€” not compatible with CUDA/ROCm
  • Not recommended for safety-critical applications without additional guardrails
  • Very long contexts (>32K tokens) may require reducing max_context_window in omlx settings depending on available memory

How to Get Started

With omlx (recommended)

Place the model directory inside your omlx model directory (default: ~/.omlx/models/). It will be auto-discovered on server start.

# Install omlx via Homebrew
brew tap jundot/omlx https://github.com/jundot/omlx
brew install omlx

# Start the server
omlx serve --model-dir ~/.omlx/models

# The model is available at http://localhost:8000/v1
# Use any OpenAI-compatible client

Or via Homebrew service (auto-restarts on crash):

brew services start omlx

With mlx-lm

pip install mlx-lm

mlx_lm.generate \
  --model chevron7/MiniMax-M2.5-oQ4 \
  --prompt "Design a distributed caching system for a global e-commerce platform."

Recommended sampling parameters

From MiniMax's official guidance:

temperature=1.0
top_p=0.95
top_k=40

Quantization Details

oQ4 โ€” Sensitivity-Driven Mixed-Precision

oQ is not uniform quantization. It runs calibration inference through the model and measures per-layer quantization sensitivity using relative MSE:

sensitivity = MSE(float_output, quantized_output) / mean(float_outputยฒ)

Layers are then assigned bit boosts based on their sensitivity rank:

Sensitivity Boost Result at oQ4
Top (โ‰ฅ50% of max) base+4 4 โ†’ 8-bit
High (โ‰ฅ20% of max) base+2 4 โ†’ 6-bit
Moderate (<20%) base+1 4 โ†’ 5-bit

Certain tensors are always protected regardless of sensitivity:

Tensor Treatment
lm_head 8-bit
embed_tokens 8-bit
MoE router weights 8-bit

This quantization

Parameter Value
oQ level oQ4
Effective bpw 4.57
Quantization plan 250 mixed-precision boosts
Group size 64
Mode Affine quantization
Streaming Yes (mmap-based, never loads full model)
Sensitivity model lmstudio-community/MiniMax-M2.5-MLX-4bit
Calibration data code_multilingual (560 texts, 128 samples ร— 256 tokens)
Calibration categories Code (Python, JS), English, Korean, Chinese, Japanese, tool calling, reasoning

Key layers promoted to 8-bit by sensitivity analysis: embed_tokens, lm_head, L0, L5, L49, L50, L51, L52.

Hardware used for quantization

  • Apple Mac Studio (M3 Ultra)
  • 256 GB unified memory
  • omlx v0.3.0 / quantize_oq_streaming

Evaluation

MiniMax M2.5 base model benchmarks (from MiniMaxAI/MiniMax-M2.5):

Benchmark Score
SWE-Bench Verified 80.2%
Multi-SWE-Bench 51.3%
BrowseComp (w/ context mgmt) 76.3%
GPQA-Diamond 85.2%
AIME 2025 86.3%
IFBench 70.0%
HLE (w/o tools) 19.4%

Quantization quality: oQ4 at 4.57 bpw has been benchmarked by the omlx project against standard mlx-lm 4-bit quantization on Qwen3.5-35B-A3B:

Benchmark mlx-lm 4-bit oQ4
MMLU (300 samples) 79.7% 83.3%
TruthfulQA (300 samples) 87.7% 88.0%
HumanEval (full) 87.2% 85.4%
MBPP (300 samples) 71.7% 74.3%

Results for MiniMax-M2.5-oQ4 specifically have not been independently benchmarked.

Model Architecture

MiniMax M2.5 is a sparse Mixture-of-Experts model:

  • Total parameters: 229B
  • Active parameters per token: ~45.9B
  • Architecture: MoE transformer with hybrid attention
  • Context length: Up to 1M tokens (native training)
  • Training: Extensive RL across 100,000+ real-world environments using the Forge agentic RL framework

Citation

@misc{minimax-m2.5,
  title        = {MiniMax-M2.5},
  author       = {{MiniMax Team}},
  year         = {2026},
  url          = {https://huggingface.co/MiniMaxAI/MiniMax-M2.5}
}

@misc{omlx,
  title        = {oMLX: LLM inference optimized for Apple Silicon},
  author       = {jundot},
  year         = {2026},
  url          = {https://github.com/jundot/omlx}
}

Model Card Authors

chevron7

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for chevron7/MiniMax-M2.5-oQ4

Finetuned
(24)
this model