Osaurus-AI commited on
Commit
93282f5
Β·
verified Β·
1 Parent(s): 74f7d54

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +108 -0
README.md ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - moe
5
+ - mixture-of-experts
6
+ - hybrid-attention
7
+ - mla
8
+ - lightning-attention
9
+ - mxfp4
10
+ - osaurus
11
+ - mlx
12
+ - bailing
13
+ - ling
14
+ - apple-silicon
15
+ base_model: inclusionAI/Ling-2.6-flash
16
+ pipeline_tag: text-generation
17
+ library_name: mlx
18
+ ---
19
+
20
+ <p align="center"><img src="osaurus-x-banner.png" width="100%"/></p>
21
+
22
+ # Ling-2.6-flash-MXFP4
23
+
24
+ **~103B-A8B hybrid MoE β€” 63 GB on disk** (down from the 200 GB bf16 source) β€”
25
+ **stock 4-bit affine** quantization on inclusionAI's Bailing-V2.5 hybrid
26
+ architecture. Loads via `mlx_lm.load()` with the `bailing_hybrid` model
27
+ class β€” no TurboQuant runtime, no sidecar required.
28
+
29
+ - **Source:** [inclusionAI/Ling-2.6-flash](https://huggingface.co/inclusionAI/Ling-2.6-flash)
30
+ (Ant Group's Bailing-V2.5 hybrid: 32 layers MLA + Lightning-Linear-Attention,
31
+ 256 experts top-8, MTP head, 131K context)
32
+ - **Quantization:** MXFP4 β€” every weight (routed experts, attention,
33
+ shared experts, dense MLP, embed, lm_head) at **4-bit affine
34
+ group_size=32**. Norms, router gates, expert biases, and slopes stay
35
+ fp16/fp32 passthrough.
36
+ - **Bundle size:** **63 GB on-disk** across 51 shards
37
+ - **Runs on:** M3 Max 96 GB+ / M4 Max 128 GB / M5 Max 128 GB / Mac Studio
38
+
39
+ ## Why two variants?
40
+
41
+ | | JANGTQ2 | MXFP4 |
42
+ |---|---|---|
43
+ | Routed experts | 2-bit MXTQ codebook (Hadamard + Lloyd-Max) | 4-bit affine |
44
+ | Attention / shared / dense | 8-bit affine | 4-bit affine |
45
+ | Bundle size | 30 GB | 63 GB |
46
+ | Quality | tighter (8-bit attention) | uniform 4-bit |
47
+ | Loader | `jang_tools.load_jangtq` (TurboQuant kernel) | stock `mlx_lm.load()` |
48
+ | Sidecar | required | not needed |
49
+ | Min RAM | 64 GB | 96 GB |
50
+
51
+ JANGTQ2 trades cheap-but-slow MXTQ codec on the routed experts for a tighter
52
+ overall bit budget. MXFP4 is the simpler "just-works-with-stock-MLX" option
53
+ for users who don't want the TurboQuant runtime in their stack.
54
+
55
+ ## Architecture (`bailing_hybrid`)
56
+
57
+ Hybrid attention β€” every 8th layer is full softmax MLA, the other 28 of 32
58
+ are Lightning-Linear-Attention. Plus a Multi-Token Prediction head.
59
+
60
+ | Layer block | Count | Attention | MLP |
61
+ |---|---|---|---|
62
+ | Layer 0 | 1 | Linear (GLA) | Dense MLP (intermediate=9216) |
63
+ | Layers 1–6, 8–14, 16–22, 24–30 | 27 | **Linear (GLA)** | MoE (256+1) |
64
+ | Layers 7, 15, 23, 31 | 4 | **MLA** (full softmax) | MoE (256+1) |
65
+ | MTP head (32) | 1 | MLA | MoE (256+1) |
66
+
67
+ See the [JANGTQ variant card](https://huggingface.co/JANGQ-AI/Ling-2.6-flash-JANGTQ)
68
+ for the deeper architecture writeup.
69
+
70
+ ## Loading (Python)
71
+
72
+ ```bash
73
+ pip install mlx-lm jang-tools
74
+ ```
75
+
76
+ ```python
77
+ from mlx_lm import load, generate
78
+ model, tokenizer = load("OsaurusAI/Ling-2.6-flash-MXFP4")
79
+ ```
80
+
81
+ Stock `mlx_lm.load()` works once `mlx_lm/models/bailing_hybrid.py` is
82
+ present (shipped with `jang-tools >= TBD`). The bundle's
83
+ `configuration_bailing_moe_v2_5.py` and `modeling_bailing_moe_v2_5.py`
84
+ provide HF compatibility for tooling that goes through transformers.
85
+
86
+ ## Reasoning + tools
87
+
88
+ Default is **`detailed thinking off`**. To enable:
89
+
90
+ ```python
91
+ messages = [
92
+ {"role": "system", "content": "detailed thinking on"},
93
+ {"role": "user", "content": "..."},
94
+ ]
95
+ ```
96
+
97
+ The model emits `<think>...</think>` reasoning blocks before answers when
98
+ thinking is on. DeepSeek-style tool-call format.
99
+
100
+ ## Credits
101
+
102
+ - **Quantization + MLX runtime:** Jinho Jang ([eric@osaurus.ai](mailto:eric@osaurus.ai))
103
+ - **Base model:** [inclusionAI](https://huggingface.co/inclusionAI) β€” Ant
104
+ Group's Bailing team
105
+ - **Architecture references:** Lightning-Attention-2 (arXiv:2401.04658),
106
+ DeepSeek-V2/V3 MLA, MTP from DeepSeek-V3
107
+ - **Osaurus:** [osaurus.ai](https://osaurus.ai) β€” Apple-Silicon-first
108
+ inference for open-weight LLMs.