MiniMax-M2.5-oQ4
A 4-bit mixed-precision MLX quantization of MiniMaxAI/MiniMax-M2.5 using oQ โ a data-driven, sensitivity-aware quantization system for Apple Silicon. Produces standard MLX safetensors compatible with omlx, mlx-lm, and any app that supports MLX format.
Model Details
Model Description
MiniMax-M2.5 is a 229B-parameter sparse Mixture-of-Experts model with 45.9B active parameters, trained extensively with reinforcement learning across hundreds of thousands of real-world environments. It achieves state-of-the-art results on coding, agentic tool use, search, and office productivity tasks โ reaching 80.2% on SWE-Bench Verified and 76.3% on BrowseComp.
This repository contains an oQ4 quantization: 4.57 bits-per-weight mixed-precision, produced via omlx's sensitivity-driven streaming quantizer. Unlike uniform 4-bit quantization, oQ measures per-layer quantization sensitivity through calibration inference and allocates bits where they matter most โ critical layers (embeddings, LM head, the most sensitive transformer layers) are automatically promoted to 8-bit, while less sensitive layers stay at 4-bit.
- Developed by: chevron7
- Shared by: chevron7
- Model type: Sparse Mixture-of-Experts (MoE) language model, MLX quantized
- Language(s): Multilingual โ trained on 10+ programming languages and broad natural language coverage
- License: MIT
- Base model: MiniMaxAI/MiniMax-M2.5
- Quantization tool: omlx v0.3.0
Model Sources
- Repository: https://huggingface.co/chevron7/MiniMax-M2.5-oQ4
- Base model: https://huggingface.co/MiniMaxAI/MiniMax-M2.5
- omlx (quantization tool): https://github.com/jundot/omlx
- oQ quantization spec: https://github.com/jundot/omlx/blob/HEAD/docs/oQ_Quantization.md
Uses
Direct Use
Intended for local inference on Apple Silicon Macs with sufficient unified memory. Drop into any omlx model directory or use directly with mlx-lm. Well-suited for:
- Agentic coding and software engineering tasks
- Long-context reasoning and planning
- Tool calling and multi-step task execution
- Multilingual tasks across 10+ languages
Downstream Use
Compatible with any OpenAI-compatible client via omlx's server (/v1/chat/completions), Claude Code via omlx's Claude Code integration, and mlx-lm's standard generate/chat/server interfaces.
Out-of-Scope Use
- Requires Apple Silicon (M1 or later) โ not compatible with CUDA/ROCm
- Not recommended for safety-critical applications without additional guardrails
- Very long contexts (>32K tokens) may require reducing
max_context_windowin omlx settings depending on available memory
How to Get Started
With omlx (recommended)
Place the model directory inside your omlx model directory (default: ~/.omlx/models/). It will be auto-discovered on server start.
# Install omlx via Homebrew
brew tap jundot/omlx https://github.com/jundot/omlx
brew install omlx
# Start the server
omlx serve --model-dir ~/.omlx/models
# The model is available at http://localhost:8000/v1
# Use any OpenAI-compatible client
Or via Homebrew service (auto-restarts on crash):
brew services start omlx
With mlx-lm
pip install mlx-lm
mlx_lm.generate \
--model chevron7/MiniMax-M2.5-oQ4 \
--prompt "Design a distributed caching system for a global e-commerce platform."
Recommended sampling parameters
From MiniMax's official guidance:
temperature=1.0
top_p=0.95
top_k=40
Quantization Details
oQ4 โ Sensitivity-Driven Mixed-Precision
oQ is not uniform quantization. It runs calibration inference through the model and measures per-layer quantization sensitivity using relative MSE:
sensitivity = MSE(float_output, quantized_output) / mean(float_outputยฒ)
Layers are then assigned bit boosts based on their sensitivity rank:
| Sensitivity | Boost | Result at oQ4 |
|---|---|---|
| Top (โฅ50% of max) | base+4 | 4 โ 8-bit |
| High (โฅ20% of max) | base+2 | 4 โ 6-bit |
| Moderate (<20%) | base+1 | 4 โ 5-bit |
Certain tensors are always protected regardless of sensitivity:
| Tensor | Treatment |
|---|---|
lm_head |
8-bit |
embed_tokens |
8-bit |
| MoE router weights | 8-bit |
This quantization
| Parameter | Value |
|---|---|
| oQ level | oQ4 |
| Effective bpw | 4.57 |
| Quantization plan | 250 mixed-precision boosts |
| Group size | 64 |
| Mode | Affine quantization |
| Streaming | Yes (mmap-based, never loads full model) |
| Sensitivity model | lmstudio-community/MiniMax-M2.5-MLX-4bit |
| Calibration data | code_multilingual (560 texts, 128 samples ร 256 tokens) |
| Calibration categories | Code (Python, JS), English, Korean, Chinese, Japanese, tool calling, reasoning |
Key layers promoted to 8-bit by sensitivity analysis: embed_tokens, lm_head, L0, L5, L49, L50, L51, L52.
Hardware used for quantization
- Apple Mac Studio (M3 Ultra)
- 256 GB unified memory
- omlx v0.3.0 / quantize_oq_streaming
Evaluation
MiniMax M2.5 base model benchmarks (from MiniMaxAI/MiniMax-M2.5):
| Benchmark | Score |
|---|---|
| SWE-Bench Verified | 80.2% |
| Multi-SWE-Bench | 51.3% |
| BrowseComp (w/ context mgmt) | 76.3% |
| GPQA-Diamond | 85.2% |
| AIME 2025 | 86.3% |
| IFBench | 70.0% |
| HLE (w/o tools) | 19.4% |
Quantization quality: oQ4 at 4.57 bpw has been benchmarked by the omlx project against standard mlx-lm 4-bit quantization on Qwen3.5-35B-A3B:
| Benchmark | mlx-lm 4-bit | oQ4 |
|---|---|---|
| MMLU (300 samples) | 79.7% | 83.3% |
| TruthfulQA (300 samples) | 87.7% | 88.0% |
| HumanEval (full) | 87.2% | 85.4% |
| MBPP (300 samples) | 71.7% | 74.3% |
Results for MiniMax-M2.5-oQ4 specifically have not been independently benchmarked.
Model Architecture
MiniMax M2.5 is a sparse Mixture-of-Experts model:
- Total parameters: 229B
- Active parameters per token: ~45.9B
- Architecture: MoE transformer with hybrid attention
- Context length: Up to 1M tokens (native training)
- Training: Extensive RL across 100,000+ real-world environments using the Forge agentic RL framework
Citation
@misc{minimax-m2.5,
title = {MiniMax-M2.5},
author = {{MiniMax Team}},
year = {2026},
url = {https://huggingface.co/MiniMaxAI/MiniMax-M2.5}
}
@misc{omlx,
title = {oMLX: LLM inference optimized for Apple Silicon},
author = {jundot},
year = {2026},
url = {https://github.com/jundot/omlx}
}
Model Card Authors
Model tree for chevron7/MiniMax-M2.5-oQ4
Base model
MiniMaxAI/MiniMax-M2.5