Osaurus-AI commited on
Commit
4024e59
·
verified ·
1 Parent(s): 8716d6a

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +87 -0
README.md ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: minimax-m2.7-non-commercial
4
+ license_link: LICENSE
5
+ library_name: mlx
6
+ tags:
7
+ - mlx
8
+ - osaurus
9
+ - jangtq
10
+ - jangtq-prestack
11
+ - jangtq-k
12
+ - mixed-precision
13
+ - minimax
14
+ - minimax-m2
15
+ - moe
16
+ - apple-silicon
17
+ pipeline_tag: text-generation
18
+ base_model: MiniMaxAI/MiniMax-M2.7
19
+ base_model_relation: quantized
20
+ ---
21
+
22
+ <p align="center"><img src="osaurus-x-banner.png" width="100%"/></p>
23
+
24
+ # MiniMax-M2.7-JANGTQ_K
25
+
26
+ **MiniMax M2.7 — 74 GB on disk** (down from ~230 GB FP8 source) — **mixed-bit
27
+ JANGTQ_K** quantization in JANGTQ-PRESTACK layout.
28
+
29
+ - **Source:** [MiniMaxAI/MiniMax-M2.7](https://huggingface.co/MiniMaxAI)
30
+ (62 layers, 256 routed experts top-8, 196K context)
31
+ - **Quantization:** **mixed-bit MXTQ** on routed experts:
32
+ - `down_proj`: **4-bit** (output enters residual stream, more sensitive)
33
+ - `gate_proj`: **2-bit** (gated activation, less sensitive)
34
+ - `up_proj`: **2-bit** (gated activation)
35
+ - attention / shared expert / embed / lm_head: 8-bit affine
36
+ - norms / router gate / expert_bias: fp16 / fp32 passthrough
37
+ - **Routed-expert layout:** **pre-stacked along axis 0** per the
38
+ JANGTQ-PRESTACK STANDARD — instant cold load, no runtime sidecar.
39
+ - **Bundle size:** **~74 GB on-disk** (~3-bit avg routed)
40
+ - **Runs on:** M3 Max 96 GB+ / M4 Max 128 GB / M5 Max 128 GB / Mac Studio
41
+
42
+ ## Why mixed-bit?
43
+
44
+ `down_proj`'s output enters the residual stream and accumulates across
45
+ 62 layers — quantization noise compounds. `gate_proj` and `up_proj`
46
+ enter through SwiGLU's multiplicative gate (`silu(gate) × up`) which
47
+ dampens noise. Spending 4 bits on `down` and 2 bits on `gate`/`up` gives
48
+ quality close to full-4-bit (~115 GB) at **64% the size**.
49
+
50
+ ## Variants in the MiniMax-M2.7 line
51
+
52
+ | Variant | Routed bits (avg) | Bundle size | Use case |
53
+ |---|---|---|---|
54
+ | `MiniMax-M2.7-JANGTQ` | 2-bit | 47 GB | smallest, best for tight RAM |
55
+ | **`MiniMax-M2.7-JANGTQ_K` (this)** | **~3-bit (mixed 2/4)** | **74 GB** | **quality close to 4-bit at 2-bit-ish size** |
56
+
57
+ ## Loading
58
+
59
+ ```bash
60
+ pip install jang-tools mlx-lm
61
+ ```
62
+
63
+ ```python
64
+ from jang_tools.load_jangtq import load_jangtq_model
65
+ model, tokenizer = load_jangtq_model("OsaurusAI/MiniMax-M2.7-JANGTQ_K")
66
+ ```
67
+
68
+ ## Reasoning + tools
69
+
70
+ - **Default:** thinking ON (chat template inserts `<think>\n` after assistant prefix)
71
+ - **Disable reasoning:**
72
+ ```python
73
+ messages = [{"role": "user", "content": "..."}]
74
+ inp = tokenizer.apply_chat_template(messages, add_generation_prompt=True, enable_thinking=False)
75
+ ```
76
+ - **Reasoning parser:** `qwen3` (extracts `<think>...</think>` blocks)
77
+ - **Tool parser:** `minimax`
78
+
79
+ The chat template ships with the `enable_thinking` switch correctly wired
80
+ both as a standalone `chat_template.jinja` AND inlined into
81
+ `tokenizer_config.json["chat_template"]` for engines that read inline
82
+ (vMLX, Swift swift-transformers).
83
+
84
+ ## Credits
85
+
86
+ - **Quantization + MLX runtime:** Jinho Jang ([eric@osaurus.ai](mailto:eric@osaurus.ai))
87
+ - **Base model:** MiniMaxAI — M2.7 architecture