MiniMax-M2.5-oQ4

A 4-bit mixed-precision MLX quantization of MiniMaxAI/MiniMax-M2.5 using oQ — a data-driven, sensitivity-aware quantization system for Apple Silicon. Produces standard MLX safetensors compatible with omlx, mlx-lm, and any app that supports MLX format.

Model Details

Model Description

MiniMax-M2.5 is a 229B-parameter sparse Mixture-of-Experts model with 45.9B active parameters, trained extensively with reinforcement learning across hundreds of thousands of real-world environments. It achieves state-of-the-art results on coding, agentic tool use, search, and office productivity tasks — reaching 80.2% on SWE-Bench Verified and 76.3% on BrowseComp.

This repository contains an oQ4 quantization: 4.57 bits-per-weight mixed-precision, produced via omlx's sensitivity-driven streaming quantizer. Unlike uniform 4-bit quantization, oQ measures per-layer quantization sensitivity through calibration inference and allocates bits where they matter most — critical layers (embeddings, LM head, the most sensitive transformer layers) are automatically promoted to 8-bit, while less sensitive layers stay at 4-bit.

Developed by: chevron7
Shared by: chevron7
Model type: Sparse Mixture-of-Experts (MoE) language model, MLX quantized
Language(s): Multilingual — trained on 10+ programming languages and broad natural language coverage
License: MIT
Base model: MiniMaxAI/MiniMax-M2.5
Quantization tool: omlx v0.3.0

Model Sources

Repository: https://huggingface.co/chevron7/MiniMax-M2.5-oQ4
Base model: https://huggingface.co/MiniMaxAI/MiniMax-M2.5
omlx (quantization tool): https://github.com/jundot/omlx
oQ quantization spec: https://github.com/jundot/omlx/blob/HEAD/docs/oQ_Quantization.md

Uses

Direct Use

Intended for local inference on Apple Silicon Macs with sufficient unified memory. Drop into any omlx model directory or use directly with mlx-lm. Well-suited for:

Agentic coding and software engineering tasks
Long-context reasoning and planning
Tool calling and multi-step task execution
Multilingual tasks across 10+ languages

Downstream Use

Compatible with any OpenAI-compatible client via omlx's server (/v1/chat/completions), Claude Code via omlx's Claude Code integration, and mlx-lm's standard generate/chat/server interfaces.

Out-of-Scope Use

Requires Apple Silicon (M1 or later) — not compatible with CUDA/ROCm
Not recommended for safety-critical applications without additional guardrails
Very long contexts (>32K tokens) may require reducing max_context_window in omlx settings depending on available memory

How to Get Started

With omlx (recommended)

Place the model directory inside your omlx model directory (default: ~/.omlx/models/). It will be auto-discovered on server start.

# Install omlx via Homebrew
brew tap jundot/omlx https://github.com/jundot/omlx
brew install omlx

# Start the server
omlx serve --model-dir ~/.omlx/models

# The model is available at http://localhost:8000/v1
# Use any OpenAI-compatible client

Or via Homebrew service (auto-restarts on crash):

brew services start omlx

With mlx-lm

pip install mlx-lm

mlx_lm.generate \
  --model chevron7/MiniMax-M2.5-oQ4 \
  --prompt "Design a distributed caching system for a global e-commerce platform."

Recommended sampling parameters

From MiniMax's official guidance:

temperature=1.0
top_p=0.95
top_k=40

Quantization Details

oQ4 — Sensitivity-Driven Mixed-Precision

oQ is not uniform quantization. It runs calibration inference through the model and measures per-layer quantization sensitivity using relative MSE:

sensitivity = MSE(float_output, quantized_output) / mean(float_output²)

Layers are then assigned bit boosts based on their sensitivity rank:

Sensitivity	Boost	Result at oQ4
Top (≥50% of max)	base+4	4 → 8-bit
High (≥20% of max)	base+2	4 → 6-bit
Moderate (<20%)	base+1	4 → 5-bit

Certain tensors are always protected regardless of sensitivity:

Tensor	Treatment
`lm_head`	8-bit
`embed_tokens`	8-bit
MoE router weights	8-bit

This quantization

Parameter	Value
oQ level	oQ4
Effective bpw	4.57
Quantization plan	250 mixed-precision boosts
Group size	64
Mode	Affine quantization
Streaming	Yes (mmap-based, never loads full model)
Sensitivity model	lmstudio-community/MiniMax-M2.5-MLX-4bit
Calibration data	code_multilingual (560 texts, 128 samples × 256 tokens)
Calibration categories	Code (Python, JS), English, Korean, Chinese, Japanese, tool calling, reasoning

Key layers promoted to 8-bit by sensitivity analysis: embed_tokens, lm_head, L0, L5, L49, L50, L51, L52.

Hardware used for quantization

Apple Mac Studio (M3 Ultra)
256 GB unified memory
omlx v0.3.0 / quantize_oq_streaming

Evaluation

MiniMax M2.5 base model benchmarks (from MiniMaxAI/MiniMax-M2.5):

Benchmark	Score
SWE-Bench Verified	80.2%
Multi-SWE-Bench	51.3%
BrowseComp (w/ context mgmt)	76.3%
GPQA-Diamond	85.2%
AIME 2025	86.3%
IFBench	70.0%
HLE (w/o tools)	19.4%

Quantization quality: oQ4 at 4.57 bpw has been benchmarked by the omlx project against standard mlx-lm 4-bit quantization on Qwen3.5-35B-A3B:

Benchmark	mlx-lm 4-bit	oQ4
MMLU (300 samples)	79.7%	83.3%
TruthfulQA (300 samples)	87.7%	88.0%
HumanEval (full)	87.2%	85.4%
MBPP (300 samples)	71.7%	74.3%

Results for MiniMax-M2.5-oQ4 specifically have not been independently benchmarked.

Model Architecture

MiniMax M2.5 is a sparse Mixture-of-Experts model:

Total parameters: 229B
Active parameters per token: ~45.9B
Architecture: MoE transformer with hybrid attention
Context length: Up to 1M tokens (native training)
Training: Extensive RL across 100,000+ real-world environments using the Forge agentic RL framework

Citation

@misc{minimax-m2.5,
  title        = {MiniMax-M2.5},
  author       = {{MiniMax Team}},
  year         = {2026},
  url          = {https://huggingface.co/MiniMaxAI/MiniMax-M2.5}
}

@misc{omlx,
  title        = {oMLX: LLM inference optimized for Apple Silicon},
  author       = {jundot},
  year         = {2026},
  url          = {https://github.com/jundot/omlx}
}

Model Card Authors

chevron7

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for chevron7/MiniMax-M2.5-oQ4

Base model

MiniMaxAI/MiniMax-M2.5

Finetuned

(24)

this model