File size: 3,683 Bytes
93282f5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
---
license: mit
tags:
  - moe
  - mixture-of-experts
  - hybrid-attention
  - mla
  - lightning-attention
  - mxfp4
  - osaurus
  - mlx
  - bailing
  - ling
  - apple-silicon
base_model: inclusionAI/Ling-2.6-flash
pipeline_tag: text-generation
library_name: mlx
---

<p align="center"><img src="osaurus-x-banner.png" width="100%"/></p>

# Ling-2.6-flash-MXFP4

**~103B-A8B hybrid MoE β€” 63 GB on disk** (down from the 200 GB bf16 source) β€”
**stock 4-bit affine** quantization on inclusionAI's Bailing-V2.5 hybrid
architecture. Loads via `mlx_lm.load()` with the `bailing_hybrid` model
class β€” no TurboQuant runtime, no sidecar required.

- **Source:** [inclusionAI/Ling-2.6-flash](https://huggingface.co/inclusionAI/Ling-2.6-flash)
  (Ant Group's Bailing-V2.5 hybrid: 32 layers MLA + Lightning-Linear-Attention,
  256 experts top-8, MTP head, 131K context)
- **Quantization:** MXFP4 β€” every weight (routed experts, attention,
  shared experts, dense MLP, embed, lm_head) at **4-bit affine
  group_size=32**. Norms, router gates, expert biases, and slopes stay
  fp16/fp32 passthrough.
- **Bundle size:** **63 GB on-disk** across 51 shards
- **Runs on:** M3 Max 96 GB+ / M4 Max 128 GB / M5 Max 128 GB / Mac Studio

## Why two variants?

| | JANGTQ2 | MXFP4 |
|---|---|---|
| Routed experts | 2-bit MXTQ codebook (Hadamard + Lloyd-Max) | 4-bit affine |
| Attention / shared / dense | 8-bit affine | 4-bit affine |
| Bundle size | 30 GB | 63 GB |
| Quality | tighter (8-bit attention) | uniform 4-bit |
| Loader | `jang_tools.load_jangtq` (TurboQuant kernel) | stock `mlx_lm.load()` |
| Sidecar | required | not needed |
| Min RAM | 64 GB | 96 GB |

JANGTQ2 trades cheap-but-slow MXTQ codec on the routed experts for a tighter
overall bit budget. MXFP4 is the simpler "just-works-with-stock-MLX" option
for users who don't want the TurboQuant runtime in their stack.

## Architecture (`bailing_hybrid`)

Hybrid attention β€” every 8th layer is full softmax MLA, the other 28 of 32
are Lightning-Linear-Attention. Plus a Multi-Token Prediction head.

| Layer block | Count | Attention | MLP |
|---|---|---|---|
| Layer 0 | 1 | Linear (GLA) | Dense MLP (intermediate=9216) |
| Layers 1–6, 8–14, 16–22, 24–30 | 27 | **Linear (GLA)** | MoE (256+1) |
| Layers 7, 15, 23, 31 | 4 | **MLA** (full softmax) | MoE (256+1) |
| MTP head (32) | 1 | MLA | MoE (256+1) |

See the [JANGTQ variant card](https://huggingface.co/JANGQ-AI/Ling-2.6-flash-JANGTQ)
for the deeper architecture writeup.

## Loading (Python)

```bash
pip install mlx-lm jang-tools
```

```python
from mlx_lm import load, generate
model, tokenizer = load("OsaurusAI/Ling-2.6-flash-MXFP4")
```

Stock `mlx_lm.load()` works once `mlx_lm/models/bailing_hybrid.py` is
present (shipped with `jang-tools >= TBD`). The bundle's
`configuration_bailing_moe_v2_5.py` and `modeling_bailing_moe_v2_5.py`
provide HF compatibility for tooling that goes through transformers.

## Reasoning + tools

Default is **`detailed thinking off`**. To enable:

```python
messages = [
    {"role": "system", "content": "detailed thinking on"},
    {"role": "user",   "content": "..."},
]
```

The model emits `<think>...</think>` reasoning blocks before answers when
thinking is on. DeepSeek-style tool-call format.

## Credits

- **Quantization + MLX runtime:** Jinho Jang ([eric@osaurus.ai](mailto:eric@osaurus.ai))
- **Base model:** [inclusionAI](https://huggingface.co/inclusionAI) β€” Ant
  Group's Bailing team
- **Architecture references:** Lightning-Attention-2 (arXiv:2401.04658),
  DeepSeek-V2/V3 MLA, MTP from DeepSeek-V3
- **Osaurus:** [osaurus.ai](https://osaurus.ai) β€” Apple-Silicon-first
  inference for open-weight LLMs.