Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,108 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
tags:
|
| 4 |
+
- moe
|
| 5 |
+
- mixture-of-experts
|
| 6 |
+
- hybrid-attention
|
| 7 |
+
- mla
|
| 8 |
+
- lightning-attention
|
| 9 |
+
- mxfp4
|
| 10 |
+
- osaurus
|
| 11 |
+
- mlx
|
| 12 |
+
- bailing
|
| 13 |
+
- ling
|
| 14 |
+
- apple-silicon
|
| 15 |
+
base_model: inclusionAI/Ling-2.6-flash
|
| 16 |
+
pipeline_tag: text-generation
|
| 17 |
+
library_name: mlx
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
<p align="center"><img src="osaurus-x-banner.png" width="100%"/></p>
|
| 21 |
+
|
| 22 |
+
# Ling-2.6-flash-MXFP4
|
| 23 |
+
|
| 24 |
+
**~103B-A8B hybrid MoE β 63 GB on disk** (down from the 200 GB bf16 source) β
|
| 25 |
+
**stock 4-bit affine** quantization on inclusionAI's Bailing-V2.5 hybrid
|
| 26 |
+
architecture. Loads via `mlx_lm.load()` with the `bailing_hybrid` model
|
| 27 |
+
class β no TurboQuant runtime, no sidecar required.
|
| 28 |
+
|
| 29 |
+
- **Source:** [inclusionAI/Ling-2.6-flash](https://huggingface.co/inclusionAI/Ling-2.6-flash)
|
| 30 |
+
(Ant Group's Bailing-V2.5 hybrid: 32 layers MLA + Lightning-Linear-Attention,
|
| 31 |
+
256 experts top-8, MTP head, 131K context)
|
| 32 |
+
- **Quantization:** MXFP4 β every weight (routed experts, attention,
|
| 33 |
+
shared experts, dense MLP, embed, lm_head) at **4-bit affine
|
| 34 |
+
group_size=32**. Norms, router gates, expert biases, and slopes stay
|
| 35 |
+
fp16/fp32 passthrough.
|
| 36 |
+
- **Bundle size:** **63 GB on-disk** across 51 shards
|
| 37 |
+
- **Runs on:** M3 Max 96 GB+ / M4 Max 128 GB / M5 Max 128 GB / Mac Studio
|
| 38 |
+
|
| 39 |
+
## Why two variants?
|
| 40 |
+
|
| 41 |
+
| | JANGTQ2 | MXFP4 |
|
| 42 |
+
|---|---|---|
|
| 43 |
+
| Routed experts | 2-bit MXTQ codebook (Hadamard + Lloyd-Max) | 4-bit affine |
|
| 44 |
+
| Attention / shared / dense | 8-bit affine | 4-bit affine |
|
| 45 |
+
| Bundle size | 30 GB | 63 GB |
|
| 46 |
+
| Quality | tighter (8-bit attention) | uniform 4-bit |
|
| 47 |
+
| Loader | `jang_tools.load_jangtq` (TurboQuant kernel) | stock `mlx_lm.load()` |
|
| 48 |
+
| Sidecar | required | not needed |
|
| 49 |
+
| Min RAM | 64 GB | 96 GB |
|
| 50 |
+
|
| 51 |
+
JANGTQ2 trades cheap-but-slow MXTQ codec on the routed experts for a tighter
|
| 52 |
+
overall bit budget. MXFP4 is the simpler "just-works-with-stock-MLX" option
|
| 53 |
+
for users who don't want the TurboQuant runtime in their stack.
|
| 54 |
+
|
| 55 |
+
## Architecture (`bailing_hybrid`)
|
| 56 |
+
|
| 57 |
+
Hybrid attention β every 8th layer is full softmax MLA, the other 28 of 32
|
| 58 |
+
are Lightning-Linear-Attention. Plus a Multi-Token Prediction head.
|
| 59 |
+
|
| 60 |
+
| Layer block | Count | Attention | MLP |
|
| 61 |
+
|---|---|---|---|
|
| 62 |
+
| Layer 0 | 1 | Linear (GLA) | Dense MLP (intermediate=9216) |
|
| 63 |
+
| Layers 1β6, 8β14, 16β22, 24β30 | 27 | **Linear (GLA)** | MoE (256+1) |
|
| 64 |
+
| Layers 7, 15, 23, 31 | 4 | **MLA** (full softmax) | MoE (256+1) |
|
| 65 |
+
| MTP head (32) | 1 | MLA | MoE (256+1) |
|
| 66 |
+
|
| 67 |
+
See the [JANGTQ variant card](https://huggingface.co/JANGQ-AI/Ling-2.6-flash-JANGTQ)
|
| 68 |
+
for the deeper architecture writeup.
|
| 69 |
+
|
| 70 |
+
## Loading (Python)
|
| 71 |
+
|
| 72 |
+
```bash
|
| 73 |
+
pip install mlx-lm jang-tools
|
| 74 |
+
```
|
| 75 |
+
|
| 76 |
+
```python
|
| 77 |
+
from mlx_lm import load, generate
|
| 78 |
+
model, tokenizer = load("OsaurusAI/Ling-2.6-flash-MXFP4")
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
+
Stock `mlx_lm.load()` works once `mlx_lm/models/bailing_hybrid.py` is
|
| 82 |
+
present (shipped with `jang-tools >= TBD`). The bundle's
|
| 83 |
+
`configuration_bailing_moe_v2_5.py` and `modeling_bailing_moe_v2_5.py`
|
| 84 |
+
provide HF compatibility for tooling that goes through transformers.
|
| 85 |
+
|
| 86 |
+
## Reasoning + tools
|
| 87 |
+
|
| 88 |
+
Default is **`detailed thinking off`**. To enable:
|
| 89 |
+
|
| 90 |
+
```python
|
| 91 |
+
messages = [
|
| 92 |
+
{"role": "system", "content": "detailed thinking on"},
|
| 93 |
+
{"role": "user", "content": "..."},
|
| 94 |
+
]
|
| 95 |
+
```
|
| 96 |
+
|
| 97 |
+
The model emits `<think>...</think>` reasoning blocks before answers when
|
| 98 |
+
thinking is on. DeepSeek-style tool-call format.
|
| 99 |
+
|
| 100 |
+
## Credits
|
| 101 |
+
|
| 102 |
+
- **Quantization + MLX runtime:** Jinho Jang ([eric@osaurus.ai](mailto:eric@osaurus.ai))
|
| 103 |
+
- **Base model:** [inclusionAI](https://huggingface.co/inclusionAI) β Ant
|
| 104 |
+
Group's Bailing team
|
| 105 |
+
- **Architecture references:** Lightning-Attention-2 (arXiv:2401.04658),
|
| 106 |
+
DeepSeek-V2/V3 MLA, MTP from DeepSeek-V3
|
| 107 |
+
- **Osaurus:** [osaurus.ai](https://osaurus.ai) β Apple-Silicon-first
|
| 108 |
+
inference for open-weight LLMs.
|