MLX Studio

JANGQ

MiniMax-M2.7 JANG_2L

MiniMax M2.7 228B MoE — 2.10-bit mixed precision, 63 GB

Smallest MiniMax M2.7 for Apple Silicon — fits on 96 GB+ Macs.

Recommended: Run in MLX Studio for best experience including thinking mode support and optimized MoE inference.

Important Settings

MiniMax M2.7 is an always-reasoning model. It thinks before answering on every prompt.

Setting Value Notes
Temperature 1.0 REQUIRED — greedy/temp=0 causes infinite thinking loops
Top P 0.95
Top K 40
Repetition Penalty 1.1 Optional, helps prevent loops

Model Details

Metric Value
Source MiniMaxAI/MiniMax-M2.7 (FP8 E4M3)
Architecture MoE (256 experts, top-8 active), GQA (48 heads / 8 KV), partial RoPE
Total Parameters 228.7B
Active Parameters ~1.4B per token
Profile JANG_2L (CRITICAL=8-bit, IMPORTANT=6-bit, COMPRESS=2-bit)
Actual avg bits 2.10
Model size 63 GB
Format JANG v2 (MLX-native safetensors, instant load)
group_size 128 (speed-optimized for 256 experts)
Routing Sigmoid + bias correction (not softmax)
QK-norm Full vector RMSNorm
Context 192K tokens

JANG_2L Bit Allocation

Tier Components Bits
CRITICAL Attention (Q/K/V/O), lm_head 8
IMPORTANT Embeddings 6
COMPRESS Expert MLP (w1/w2/w3) — 98%+ of params 2
Passthrough MoE router/gate (float16), norms, QK-norms 16

JANG protects routing and attention at full precision while compressing the 256 expert MLPs — where MoE models are most tolerant of quantization. The router is kept at float16 (no quantization) for maximum routing precision.

MMLU Benchmarks (200q, 10 subjects, reasoning ON)

Coming soon — benchmarks in progress.

Why JANG for MiniMax

Standard MLX quantization on MiniMax produces completely broken output at ALL bit levels (~25% MMLU = random guessing). JANG's mixed-precision approach is the only working quantized MiniMax on Apple Silicon.

On M2.5, JANG_2L achieved 74% MMLU vs MLX's 25% (random). M2.7 results pending.

All Quantizations

Model Profile Size Avg Bits
JANG_2L (8, 6, 2) 63 GB 2.10
JANG_3L (8, 4, 3) 89 GB 3.08
JANG_4M (8, 4, 4) 115 GB 4.06
JANG_6M (8, 6, 6) 167 GB 6.03

Requirements

  • Apple Silicon Mac with 96 GB unified memory
  • MLX framework
  • MLX Studio recommended

Tool Use / Agent Mode

MiniMax M2.7 uses interleaved thinking + tool calls — it reasons inside <think> blocks, then emits tool calls in <minimax:tool_call> format. Some clients (Opencode, etc.) may strip the <think> block and miss the tool call.

For tool-use clients, set enable_thinking=False in the chat template:

text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
    enable_thinking=False  # skips <think> injection for tool-use
)

MiniMax tool call format:

<minimax:tool_call>
<invoke name="tool_name">
<parameter name="param1">value1</parameter>
</invoke>
</minimax:tool_call>

Usage

from jang_tools.loader import load_jang_model
from mlx_lm import generate
from mlx_lm.sample_utils import make_sampler

model, tokenizer = load_jang_model("JANGQ-AI/MiniMax-M2.7-JANG_2L")
sampler = make_sampler(temp=1.0, top_p=0.95)

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "What is photosynthesis?"}],
    tokenize=False, add_generation_prompt=True
)
output = generate(model, tokenizer, prompt=prompt, max_tokens=2048, sampler=sampler)
print(output)

Support

MLX Studio | JANGQ | X @dealignai

Quantized by Jinho Jang (eric@jangq.ai) using JANG Tools v2.4.1.


This model is provided for research and personal use. Users are responsible for ensuring their use complies with applicable laws and the MiniMax license.

Downloads last month
-
Safetensors
Model size
19B params
Tensor type
U32
·
F16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support