MLX Studio

JANGQ

MiniMax-M2.7 JANG_6M

MiniMax M2.7 228B MoE — 6.03-bit mixed precision, 167 GB

Near-lossless quantization for maximum quality on Apple Silicon.

Recommended: Run in MLX Studio for best experience including thinking mode support and optimized MoE inference.

Important Settings

MiniMax M2.7 is an always-reasoning model. It thinks before answering on every prompt.

Setting Value Notes
Temperature 1.0 REQUIRED — greedy/temp=0 causes infinite thinking loops
Top P 0.95
Top K 40
Repetition Penalty 1.1 Optional, helps prevent loops

Model Details

Metric Value
Source MiniMaxAI/MiniMax-M2.7 (FP8 E4M3)
Architecture MoE (256 experts, top-8 active), GQA (48 heads / 8 KV), partial RoPE
Total Parameters 228.7B
Active Parameters ~1.4B per token
Profile JANG_6M (CRITICAL=8-bit, IMPORTANT=6-bit, COMPRESS=6-bit)
Actual avg bits 6.03
Model size 167 GB
Format JANG v2 (MLX-native safetensors, instant load)
group_size 128 (speed-optimized for 256 experts)
Routing Sigmoid + bias correction (not softmax)
QK-norm Full vector RMSNorm
Context 192K tokens

JANG_6M Bit Allocation

Tier Components Bits
CRITICAL Attention (Q/K/V/O), lm_head 8
IMPORTANT Embeddings 6
COMPRESS Expert MLP (w1/w2/w3) — 98%+ of params 6
Passthrough MoE router/gate (float16), norms, QK-norms 16

JANG protects routing and attention at full precision while compressing the 256 expert MLPs — where MoE models are most tolerant of quantization. The router is kept at float16 (no quantization) for maximum routing precision.

MMLU Benchmarks (200q, 10 subjects, reasoning ON)

Coming soon — benchmarks in progress.

Why JANG for MiniMax

Standard MLX quantization on MiniMax produces completely broken output at ALL bit levels (~25% MMLU = random guessing). JANG's mixed-precision approach is the only working quantized MiniMax on Apple Silicon.

On M2.5, JANG_2L achieved 74% MMLU vs MLX's 25% (random). M2.7 results pending.

All Quantizations

Model Profile Size Avg Bits
JANG_2L (8, 6, 2) 63 GB 2.10
JANG_3L (8, 4, 3) 89 GB 3.08
JANG_4M (8, 4, 4) 115 GB 4.06
JANG_6M (8, 6, 6) 167 GB 6.03

Requirements

  • Apple Silicon Mac with 192 GB unified memory
  • MLX framework
  • MLX Studio recommended

Tool Use / Agent Mode

MiniMax M2.7 uses interleaved thinking + tool calls — it reasons inside <think> blocks, then emits tool calls in <minimax:tool_call> format. Some clients (Opencode, etc.) may strip the <think> block and miss the tool call.

For tool-use clients, set enable_thinking=False in the chat template:

text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
    enable_thinking=False  # skips <think> injection for tool-use
)

MiniMax tool call format:

<minimax:tool_call>
<invoke name="tool_name">
<parameter name="param1">value1</parameter>
</invoke>
</minimax:tool_call>

Usage

from jang_tools.loader import load_jang_model
from mlx_lm import generate
from mlx_lm.sample_utils import make_sampler

model, tokenizer = load_jang_model("JANGQ-AI/MiniMax-M2.7-JANG_6M")
sampler = make_sampler(temp=1.0, top_p=0.95)

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "What is photosynthesis?"}],
    tokenize=False, add_generation_prompt=True
)
output = generate(model, tokenizer, prompt=prompt, max_tokens=2048, sampler=sampler)
print(output)

Support

MLX Studio | JANGQ | X @dealignai

Quantized by Jinho Jang (eric@jangq.ai) using JANG Tools v2.4.1.


This model is provided for research and personal use. Users are responsible for ensuring their use complies with applicable laws and the MiniMax license.

Downloads last month
1,011
Safetensors
Model size
47B params
Tensor type
U32
·
F16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support