File size: 3,683 Bytes

93282f5

---
license: mit
tags:
  - moe
  - mixture-of-experts
  - hybrid-attention
  - mla
  - lightning-attention
  - mxfp4
  - osaurus
  - mlx
  - bailing
  - ling
  - apple-silicon
base_model: inclusionAI/Ling-2.6-flash
pipeline_tag: text-generation
library_name: mlx
---

<p align="center"><img src="osaurus-x-banner.png" width="100%"/></p>

# Ling-2.6-flash-MXFP4

**~103B-A8B hybrid MoE — 63 GB on disk** (down from the 200 GB bf16 source) —
**stock 4-bit affine** quantization on inclusionAI's Bailing-V2.5 hybrid
architecture. Loads via `mlx_lm.load()` with the `bailing_hybrid` model
class — no TurboQuant runtime, no sidecar required.

- **Source:** [inclusionAI/Ling-2.6-flash](https://huggingface.co/inclusionAI/Ling-2.6-flash)
  (Ant Group's Bailing-V2.5 hybrid: 32 layers MLA + Lightning-Linear-Attention,
  256 experts top-8, MTP head, 131K context)
- **Quantization:** MXFP4 — every weight (routed experts, attention,
  shared experts, dense MLP, embed, lm_head) at **4-bit affine
  group_size=32**. Norms, router gates, expert biases, and slopes stay
  fp16/fp32 passthrough.
- **Bundle size:** **63 GB on-disk** across 51 shards
- **Runs on:** M3 Max 96 GB+ / M4 Max 128 GB / M5 Max 128 GB / Mac Studio

## Why two variants?

| | JANGTQ2 | MXFP4 |
|---|---|---|
| Routed experts | 2-bit MXTQ codebook (Hadamard + Lloyd-Max) | 4-bit affine |
| Attention / shared / dense | 8-bit affine | 4-bit affine |
| Bundle size | 30 GB | 63 GB |
| Quality | tighter (8-bit attention) | uniform 4-bit |
| Loader | `jang_tools.load_jangtq` (TurboQuant kernel) | stock `mlx_lm.load()` |
| Sidecar | required | not needed |
| Min RAM | 64 GB | 96 GB |

JANGTQ2 trades cheap-but-slow MXTQ codec on the routed experts for a tighter
overall bit budget. MXFP4 is the simpler "just-works-with-stock-MLX" option
for users who don't want the TurboQuant runtime in their stack.

## Architecture (`bailing_hybrid`)

Hybrid attention — every 8th layer is full softmax MLA, the other 28 of 32
are Lightning-Linear-Attention. Plus a Multi-Token Prediction head.

| Layer block | Count | Attention | MLP |
|---|---|---|---|
| Layer 0 | 1 | Linear (GLA) | Dense MLP (intermediate=9216) |
| Layers 1–6, 8–14, 16–22, 24–30 | 27 | **Linear (GLA)** | MoE (256+1) |
| Layers 7, 15, 23, 31 | 4 | **MLA** (full softmax) | MoE (256+1) |
| MTP head (32) | 1 | MLA | MoE (256+1) |

See the [JANGTQ variant card](https://huggingface.co/JANGQ-AI/Ling-2.6-flash-JANGTQ)
for the deeper architecture writeup.

## Loading (Python)

```bash
pip install mlx-lm jang-tools
```

```python
from mlx_lm import load, generate
model, tokenizer = load("OsaurusAI/Ling-2.6-flash-MXFP4")
```

Stock `mlx_lm.load()` works once `mlx_lm/models/bailing_hybrid.py` is
present (shipped with `jang-tools >= TBD`). The bundle's
`configuration_bailing_moe_v2_5.py` and `modeling_bailing_moe_v2_5.py`
provide HF compatibility for tooling that goes through transformers.

## Reasoning + tools

Default is **`detailed thinking off`**. To enable:

```python
messages = [
    {"role": "system", "content": "detailed thinking on"},
    {"role": "user",   "content": "..."},
]
```

The model emits `<think>...</think>` reasoning blocks before answers when
thinking is on. DeepSeek-style tool-call format.

## Credits

- **Quantization + MLX runtime:** Jinho Jang ([eric@osaurus.ai](mailto:eric@osaurus.ai))
- **Base model:** [inclusionAI](https://huggingface.co/inclusionAI) — Ant
  Group's Bailing team
- **Architecture references:** Lightning-Attention-2 (arXiv:2401.04658),
  DeepSeek-V2/V3 MLA, MTP from DeepSeek-V3
- **Osaurus:** [osaurus.ai](https://osaurus.ai) — Apple-Silicon-first
  inference for open-weight LLMs.