Instructions to use OsaurusAI/MiniMax-M2.7-JANGTQ4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use OsaurusAI/MiniMax-M2.7-JANGTQ4 with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("OsaurusAI/MiniMax-M2.7-JANGTQ4")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps
LM Studio

Pi new

How to use OsaurusAI/MiniMax-M2.7-JANGTQ4 with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "OsaurusAI/MiniMax-M2.7-JANGTQ4"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "OsaurusAI/MiniMax-M2.7-JANGTQ4"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use OsaurusAI/MiniMax-M2.7-JANGTQ4 with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "OsaurusAI/MiniMax-M2.7-JANGTQ4"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default OsaurusAI/MiniMax-M2.7-JANGTQ4

Run Hermes

hermes

MLX LM

How to use OsaurusAI/MiniMax-M2.7-JANGTQ4 with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "OsaurusAI/MiniMax-M2.7-JANGTQ4"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "OsaurusAI/MiniMax-M2.7-JANGTQ4"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "OsaurusAI/MiniMax-M2.7-JANGTQ4",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

MiniMax-M2.7-JANGTQ4

File size: 7,362 Bytes

---
language:
- en
- zh
library_name: mlx
license: mit
pipeline_tag: text-generation
base_model: MiniMaxAI/MiniMax-M2.7
base_model_relation: quantized
tags:
- mlx
- jang
- jangtq
- jangtq4
- minimax
- minimax_m2
- moe
- apple-silicon
- 4bit
- turboquant
---
> ## ⚠️ REQUIRED — `jangtq_runtime.safetensors` sidecar must be downloaded
>
> Osaurus uses the native Swift JANGTQ runtime. **Every JANGTQ bundle on
> OsaurusAI ships a small `jangtq_runtime.safetensors` sidecar (~10 KB–~165 KB)
> alongside the weight shards.** The Swift loader will refuse to start with
> the error
> ```
> Error: Model '<name>' declares JANGTQ (weight_format: "mxtq") but is
>        missing required sidecar file 'jangtq_runtime.safetensors'.
>        Re-download the full model or obtain the sidecar from the original
>        publisher.
> ```
> if the file is absent.
>
> If your local copy doesn't have it (older download, partial sync, etc):
> ```bash
> hf download OsaurusAI/MiniMax-M2.7-JANGTQ4 jangtq_runtime.safetensors --local-dir <your-dir>
> ```
> The file holds the deterministic codebooks + Hadamard rotation signs the
> Swift loader uses to decode `*.tq_packed` weights. It must match the seed
> the bundle was quantized with (`mxtq_seed=42`).


<p align="center">
  <a href="https://osaurus.ai"><img src="./osaurus-x-banner.png" alt="Osaurus AI"></a>
</p>

<h3 align="center">MiniMax M2.7 &mdash; JANGTQ4 (MLX)</h3>
<p align="center">TurboQuant codebook quantization of MiniMax's 228B agentic MoE &mdash; routed experts at 4-bit via Lloyd-Max codebooks + Hadamard rotation, attention / embed / shared-expert / lm_head at 8-bit affine. Near-bf16 quality at ~25% of bf16 disk.</p>

<p align="center">
  <a href="https://osaurus.ai"><img src="https://img.shields.io/badge/Web-osaurus.ai-blue" alt="Website"></a>&nbsp;
  <a href="https://huggingface.co/OsaurusAI"><img src="https://img.shields.io/badge/HF-OsaurusAI-yellow?logo=huggingface" alt="OsaurusAI"></a>
</p>

---

## Model Details

| Property | Value |
|---|---|
| **Base Model** | MiniMaxAI/MiniMax-M2.7 |
| **Architecture** | MoE (256 experts, top-8 active) + standard Q/K/V attention + partial RoPE |
| **Total Parameters** | 228.7 B |
| **Active per Token** | ~1.4 B |
| **Profile** | JANGTQ4 |
| **Format** | JANGTQ (codebook + Hadamard) — `weight_format: mxtq` in `jang_config.json` |
| **Avg bits/param** | ~4.10 |
| **Codebook size** | 16 entries (4-bit) |
| **Disk** | ~113 GB |
| **Context length** | 192 K tokens |
| **Chat template** | Always-reasoning (`<think>` opened at assistant start) |

## What is JANGTQ4?

**JANGTQ** (JANG TurboQuant) is a codebook-based quantization format for MoE
models on Apple Silicon. Routed expert weights stay in a compact **codebook +
Hadamard-rotated** form at runtime — no decompression to affine — and the
matmul path uses custom Metal kernels that read packed `uint32` weights, look
up centroids in a small codebook, and accumulate dot products against a
Hadamard-rotated input (QuIP# *rotate-input-once* math).

**JANGTQ4** uses a 16-entry Lloyd-Max codebook per routed expert tensor, which
captures the weight distribution near-losslessly. Quality approaches bf16 at
~25% of bf16 disk and runs at the full JANGTQ decode speed. Pick this profile
when RAM permits and you want the closest quality to bf16 on Apple Silicon;
pick JANGTQ (2-bit) for the smallest footprint.

## JANGTQ vs JANGTQ4 vs bf16

| | JANGTQ (2-bit) | **JANGTQ4** | bf16 |
|---|---|---|---|
| Disk | ~57 GB | **~113 GB** | ~457 GB |
| Routed expert bits | 2 | **4** | 16 |
| Codebook size | 4 entries | **16 entries** | — |
| Avg bits/param | ~2.15 | **~4.10** | 16 |

## Bit Allocation

| Component | Bits | Format |
|---|:---:|---|
| Routed expert MLP (gate / up / down) | **4** | JANGTQ codebook + Hadamard |
| Attention (Q / K / V / O) | 8 | Affine (`nn.QuantizedLinear`, group_size=64) |
| Shared expert | 8 | Affine |
| Embed tokens / LM head | 8 | Affine |
| Router gate | fp16 | Unquantized `nn.Linear` |
| RMSNorms / RoPE / biases | fp16 | Unquantized |

The routed experts are 98 % of parameters and the natural compression target.
Everything else stays at 8-bit affine so the quality-critical hot path runs
at full precision.

## Important Settings

MiniMax M2.7 is an **always-reasoning** model. The chat template
unconditionally opens `<think>` at each assistant turn.

| Setting | Value | Notes |
|---|---|---|
| Temperature | **1.0** | Required — `temp=0` can cause thinking loops |
| Top-P | 0.95 | |
| Top-K | 40 | |
| Repetition Penalty | 1.1 | Optional, helps prevent loops |
| `max_tokens` | ≥ 8192 | Give reasoning room to converge |

Strip `<think>…</think>` from the response before using the final answer.

## Usage

This model requires the `jang-tools` loader — stock `mlx_lm.load()` does not
recognize `weight_format: mxtq`. The loader applies Metal kernel
monkey-patches at load time (fused gate+up+SwiGLU, gather TQ, multi-block
Hadamard, router compile, QKV fusion).

```bash
pip install jang-tools
```

```python
from huggingface_hub import snapshot_download
from jang_tools.load_jangtq import load_jangtq_model
from mlx_lm import generate

model_path = snapshot_download("OsaurusAI/MiniMax-M2.7-JANGTQ4")
model, tokenizer = load_jangtq_model(model_path)

messages = [{"role": "user", "content": "Explain photosynthesis in five sentences."}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=False
)
out = generate(model, tokenizer, prompt, max_tokens=600,
               temperature=1.0, verbose=True)
```

### Swift — Osaurus / MLX Studio

Both clients auto-detect the JANGTQ runtime from `jang_config.json` and route
through the `MiniMaxJANGTQModel` class. Just load the repo — no extra flags.

## What's In This Repo

| File | Role |
|---|---|
| `model-*.safetensors` (117 shards, ~113 GB) | Weights — 4-bit routed TQ + 8-bit affine |
| `model.safetensors.index.json` | Shard index |
| `jangtq_runtime.safetensors` | Codebooks + Hadamard signs sidecar (Swift loader) |
| `jang_config.json` | JANG metadata + Tier-1 `capabilities` stamp (`reasoning=qwen3`, `tool=minimax`) |
| `config.json` | HF model config (`minimax_m2`, `weight_format=mxtq`, `mxtq_bits=4`) |
| `chat_template.jinja`, `tokenizer.*`, `vocab.json`, `merges.txt` | Tokenizer + chat template |
| `configuration_minimax_m2.py`, `modeling_minimax_m2.py` | HF custom code (untouched from upstream) |
| `osaurus-x-banner.png` | Branding asset |

## Parser Capabilities (Tier-1 auto-detected by Osaurus / vmlx)

```json
{
  "reasoning_parser": "qwen3",
  "tool_parser": "minimax",
  "think_in_template": true,
  "supports_tools": true,
  "supports_thinking": true,
  "family": "minimax_m2",
  "modality": "text",
  "cache_type": "kv"
}
```

`<think>` and `<tool_call>` are non-special tokens by design — the
application layer parses them. Osaurus and `vmlx` `CapabilityDetector` read
this block verbatim and wire the `qwen3` reasoning parser + `minimax` tool
parser automatically, so streamed responses route `reasoning_content` and
`tool_calls` into the OpenAI-compatible SSE fields instead of leaking into
`content`.

## License

MIT — see [`LICENSE`](./LICENSE).

## Credits

Created by [Jinho Jang](https://twitter.com/jangq_ai) — `eric@jangq.ai`

Based on MiniMaxAI's MiniMax M2.7. JANGTQ quantization © JANGQ-AI.