MiniMax-M2.7 JANG_2L

MiniMax M2.7 228B MoE — 2.10-bit mixed precision, 63 GB

Smallest MiniMax M2.7 for Apple Silicon — fits on 96 GB+ Macs.

Recommended: Run in MLX Studio for best experience including thinking mode support and optimized MoE inference.

Important Settings

MiniMax M2.7 is an always-reasoning model. It thinks before answering on every prompt.

Setting	Value	Notes
Temperature	1.0	REQUIRED — greedy/temp=0 causes infinite thinking loops
Top P	0.95
Top K	40
Repetition Penalty	1.1	Optional, helps prevent loops

Model Details

Metric	Value
Source	`MiniMaxAI/MiniMax-M2.7` (FP8 E4M3)
Architecture	MoE (256 experts, top-8 active), GQA (48 heads / 8 KV), partial RoPE
Total Parameters	228.7B
Active Parameters	~1.4B per token
Profile	JANG_2L (CRITICAL=8-bit, IMPORTANT=6-bit, COMPRESS=2-bit)
Actual avg bits	2.10
Model size	63 GB
Format	JANG v2 (MLX-native safetensors, instant load)
group_size	128 (speed-optimized for 256 experts)
Routing	Sigmoid + bias correction (not softmax)
QK-norm	Full vector RMSNorm
Context	192K tokens

JANG_2L Bit Allocation

Tier	Components	Bits
CRITICAL	Attention (Q/K/V/O), lm_head	8
IMPORTANT	Embeddings	6
COMPRESS	Expert MLP (w1/w2/w3) — 98%+ of params	2
Passthrough	MoE router/gate (float16), norms, QK-norms	16

JANG protects routing and attention at full precision while compressing the 256 expert MLPs — where MoE models are most tolerant of quantization. The router is kept at float16 (no quantization) for maximum routing precision.

MMLU Benchmarks (200q, 10 subjects, reasoning ON)

Coming soon — benchmarks in progress.

Why JANG for MiniMax

Standard MLX quantization on MiniMax produces completely broken output at ALL bit levels (~25% MMLU = random guessing). JANG's mixed-precision approach is the only working quantized MiniMax on Apple Silicon.

On M2.5, JANG_2L achieved 74% MMLU vs MLX's 25% (random). M2.7 results pending.

All Quantizations

Model	Profile	Size	Avg Bits
JANG_2L	(8, 6, 2)	63 GB	2.10
JANG_3L	(8, 4, 3)	89 GB	3.08
JANG_4M	(8, 4, 4)	115 GB	4.06
JANG_6M	(8, 6, 6)	167 GB	6.03

Requirements

Apple Silicon Mac with 96 GB unified memory
MLX framework
MLX Studio recommended

Tool Use / Agent Mode

MiniMax M2.7 uses interleaved thinking + tool calls — it reasons inside <think> blocks, then emits tool calls in <minimax:tool_call> format. Some clients (Opencode, etc.) may strip the <think> block and miss the tool call.

For tool-use clients, set enable_thinking=False in the chat template:

text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
    enable_thinking=False  # skips <think> injection for tool-use
)

MiniMax tool call format:

<minimax:tool_call>
<invoke name="tool_name">
<parameter name="param1">value1</parameter>
</invoke>
</minimax:tool_call>

Usage

from jang_tools.loader import load_jang_model
from mlx_lm import generate
from mlx_lm.sample_utils import make_sampler

model, tokenizer = load_jang_model("JANGQ-AI/MiniMax-M2.7-JANG_2L")
sampler = make_sampler(temp=1.0, top_p=0.95)

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "What is photosynthesis?"}],
    tokenize=False, add_generation_prompt=True
)
output = generate(model, tokenizer, prompt=prompt, max_tokens=2048, sampler=sampler)
print(output)

Support

MLX Studio | JANGQ | X @dealignai

Quantized by Jinho Jang (eric@jangq.ai) using JANG Tools v2.4.1.

This model is provided for research and personal use. Users are responsible for ensuring their use complies with applicable laws and the MiniMax license.

Downloads last month: -

Safetensors

Model size

19B params

Tensor type

U32

F16

MLX

Hardware compatibility

Quantized