---
license: mit
tags:
- moe
- mixture-of-experts
- hybrid-attention
- mla
- lightning-attention
- mxfp4
- osaurus
- mlx
- bailing
- ling
- apple-silicon
base_model: inclusionAI/Ling-2.6-flash
pipeline_tag: text-generation
library_name: mlx
---

# Ling-2.6-flash-MXFP4
**~103B-A8B hybrid MoE — 63 GB on disk** (down from the 200 GB bf16 source) —
**stock 4-bit affine** quantization on inclusionAI's Bailing-V2.5 hybrid
architecture. Loads via `mlx_lm.load()` with the `bailing_hybrid` model
class — no TurboQuant runtime, no sidecar required.
- **Source:** [inclusionAI/Ling-2.6-flash](https://huggingface.co/inclusionAI/Ling-2.6-flash)
(Ant Group's Bailing-V2.5 hybrid: 32 layers MLA + Lightning-Linear-Attention,
256 experts top-8, MTP head, 131K context)
- **Quantization:** MXFP4 — every weight (routed experts, attention,
shared experts, dense MLP, embed, lm_head) at **4-bit affine
group_size=32**. Norms, router gates, expert biases, and slopes stay
fp16/fp32 passthrough.
- **Bundle size:** **63 GB on-disk** across 51 shards
- **Runs on:** M3 Max 96 GB+ / M4 Max 128 GB / M5 Max 128 GB / Mac Studio
## Why two variants?
| | JANGTQ2 | MXFP4 |
|---|---|---|
| Routed experts | 2-bit MXTQ codebook (Hadamard + Lloyd-Max) | 4-bit affine |
| Attention / shared / dense | 8-bit affine | 4-bit affine |
| Bundle size | 30 GB | 63 GB |
| Quality | tighter (8-bit attention) | uniform 4-bit |
| Loader | `jang_tools.load_jangtq` (TurboQuant kernel) | stock `mlx_lm.load()` |
| Sidecar | required | not needed |
| Min RAM | 64 GB | 96 GB |
JANGTQ2 trades cheap-but-slow MXTQ codec on the routed experts for a tighter
overall bit budget. MXFP4 is the simpler "just-works-with-stock-MLX" option
for users who don't want the TurboQuant runtime in their stack.
## Architecture (`bailing_hybrid`)
Hybrid attention — every 8th layer is full softmax MLA, the other 28 of 32
are Lightning-Linear-Attention. Plus a Multi-Token Prediction head.
| Layer block | Count | Attention | MLP |
|---|---|---|---|
| Layer 0 | 1 | Linear (GLA) | Dense MLP (intermediate=9216) |
| Layers 1–6, 8–14, 16–22, 24–30 | 27 | **Linear (GLA)** | MoE (256+1) |
| Layers 7, 15, 23, 31 | 4 | **MLA** (full softmax) | MoE (256+1) |
| MTP head (32) | 1 | MLA | MoE (256+1) |
See the [JANGTQ variant card](https://huggingface.co/JANGQ-AI/Ling-2.6-flash-JANGTQ)
for the deeper architecture writeup.
## Loading (Python)
```bash
pip install mlx-lm jang-tools
```
```python
from mlx_lm import load, generate
model, tokenizer = load("OsaurusAI/Ling-2.6-flash-MXFP4")
```
Stock `mlx_lm.load()` works once `mlx_lm/models/bailing_hybrid.py` is
present (shipped with `jang-tools >= TBD`). The bundle's
`configuration_bailing_moe_v2_5.py` and `modeling_bailing_moe_v2_5.py`
provide HF compatibility for tooling that goes through transformers.
## Reasoning + tools
Default is **`detailed thinking off`**. To enable:
```python
messages = [
{"role": "system", "content": "detailed thinking on"},
{"role": "user", "content": "..."},
]
```
The model emits `...` reasoning blocks before answers when
thinking is on. DeepSeek-style tool-call format.
## Credits
- **Quantization + MLX runtime:** Jinho Jang ([eric@osaurus.ai](mailto:eric@osaurus.ai))
- **Base model:** [inclusionAI](https://huggingface.co/inclusionAI) — Ant
Group's Bailing team
- **Architecture references:** Lightning-Attention-2 (arXiv:2401.04658),
DeepSeek-V2/V3 MLA, MTP from DeepSeek-V3
- **Osaurus:** [osaurus.ai](https://osaurus.ai) — Apple-Silicon-first
inference for open-weight LLMs.