MiniMax-M2.7-JANGTQ_K

MiniMax M2.7 — 74 GB on disk (down from ~230 GB FP8 source) — mixed-bit JANGTQ_K quantization in JANGTQ-PRESTACK layout.

  • Source: MiniMaxAI/MiniMax-M2.7 (62 layers, 256 routed experts top-8, 196K context)
  • Quantization: mixed-bit MXTQ on routed experts:
    • down_proj: 4-bit (output enters residual stream, more sensitive)
    • gate_proj: 2-bit (gated activation, less sensitive)
    • up_proj: 2-bit (gated activation)
    • attention / shared expert / embed / lm_head: 8-bit affine
    • norms / router gate / expert_bias: fp16 / fp32 passthrough
  • Routed-expert layout: pre-stacked along axis 0 per the JANGTQ-PRESTACK STANDARD — instant cold load, no runtime sidecar.
  • Bundle size: ~74 GB on-disk (~3-bit avg routed)
  • Runs on: M3 Max 96 GB+ / M4 Max 128 GB / M5 Max 128 GB / Mac Studio

Why mixed-bit?

down_proj's output enters the residual stream and accumulates across 62 layers — quantization noise compounds. gate_proj and up_proj enter through SwiGLU's multiplicative gate (silu(gate) × up) which dampens noise. Spending 4 bits on down and 2 bits on gate/up gives quality close to full-4-bit (~115 GB) at 64% the size.

Variants in the MiniMax-M2.7 line

Variant Routed bits (avg) Bundle size Use case
MiniMax-M2.7-JANGTQ 2-bit 47 GB smallest, best for tight RAM
MiniMax-M2.7-JANGTQ_K (this) ~3-bit (mixed 2/4) 74 GB quality close to 4-bit at 2-bit-ish size

Loading

pip install jang-tools mlx-lm
from jang_tools.load_jangtq import load_jangtq_model
model, tokenizer = load_jangtq_model("OsaurusAI/MiniMax-M2.7-JANGTQ_K")

Reasoning + tools

  • Default: thinking ON (chat template inserts <think>\n after assistant prefix)
  • Disable reasoning:
    messages = [{"role": "user", "content": "..."}]
    inp = tokenizer.apply_chat_template(messages, add_generation_prompt=True, enable_thinking=False)
    
  • Reasoning parser: qwen3 (extracts <think>...</think> blocks)
  • Tool parser: minimax

The chat template ships with the enable_thinking switch correctly wired both as a standalone chat_template.jinja AND inlined into tokenizer_config.json["chat_template"] for engines that read inline (vMLX, Swift swift-transformers).

Credits

  • Quantization + MLX runtime: Jinho Jang (eric@osaurus.ai)
  • Base model: MiniMaxAI — M2.7 architecture
Downloads last month
782
Safetensors
Model size
20B params
Tensor type
U32
·
F16
·
U8
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OsaurusAI/MiniMax-M2.7-JANGTQ_K

Quantized
(103)
this model