OsaurusAI
/

Ling-2.6-flash-MXFP4

+---
+license: mit
+tags:
+  - moe
+  - mixture-of-experts
+  - hybrid-attention
+  - mla
+  - lightning-attention
+  - mxfp4
+  - osaurus
+  - mlx
+  - bailing
+  - ling
+  - apple-silicon
+base_model: inclusionAI/Ling-2.6-flash
+pipeline_tag: text-generation
+library_name: mlx
+---
+<p align="center"><img src="osaurus-x-banner.png" width="100%"/></p>
+# Ling-2.6-flash-MXFP4
+**~103B-A8B hybrid MoE — 63 GB on disk** (down from the 200 GB bf16 source) —
+**stock 4-bit affine** quantization on inclusionAI's Bailing-V2.5 hybrid
+architecture. Loads via `mlx_lm.load()` with the `bailing_hybrid` model
+class — no TurboQuant runtime, no sidecar required.
+- **Source:** [inclusionAI/Ling-2.6-flash](https://huggingface.co/inclusionAI/Ling-2.6-flash)
+  (Ant Group's Bailing-V2.5 hybrid: 32 layers MLA + Lightning-Linear-Attention,
+  256 experts top-8, MTP head, 131K context)
+- **Quantization:** MXFP4 — every weight (routed experts, attention,
+  shared experts, dense MLP, embed, lm_head) at **4-bit affine
+  group_size=32**. Norms, router gates, expert biases, and slopes stay
+  fp16/fp32 passthrough.
+- **Bundle size:** **63 GB on-disk** across 51 shards
+- **Runs on:** M3 Max 96 GB+ / M4 Max 128 GB / M5 Max 128 GB / Mac Studio
+## Why two variants?
+| | JANGTQ2 | MXFP4 |
+|---|---|---|
+| Routed experts | 2-bit MXTQ codebook (Hadamard + Lloyd-Max) | 4-bit affine |
+| Attention / shared / dense | 8-bit affine | 4-bit affine |
+| Bundle size | 30 GB | 63 GB |
+| Quality | tighter (8-bit attention) | uniform 4-bit |
+| Loader | `jang_tools.load_jangtq` (TurboQuant kernel) | stock `mlx_lm.load()` |
+| Sidecar | required | not needed |
+| Min RAM | 64 GB | 96 GB |
+JANGTQ2 trades cheap-but-slow MXTQ codec on the routed experts for a tighter
+overall bit budget. MXFP4 is the simpler "just-works-with-stock-MLX" option
+for users who don't want the TurboQuant runtime in their stack.
+## Architecture (`bailing_hybrid`)
+Hybrid attention — every 8th layer is full softmax MLA, the other 28 of 32
+are Lightning-Linear-Attention. Plus a Multi-Token Prediction head.
+| Layer block | Count | Attention | MLP |
+|---|---|---|---|
+| Layer 0 | 1 | Linear (GLA) | Dense MLP (intermediate=9216) |
+| Layers 1–6, 8–14, 16–22, 24–30 | 27 | **Linear (GLA)** | MoE (256+1) |
+| Layers 7, 15, 23, 31 | 4 | **MLA** (full softmax) | MoE (256+1) |
+| MTP head (32) | 1 | MLA | MoE (256+1) |
+See the [JANGTQ variant card](https://huggingface.co/JANGQ-AI/Ling-2.6-flash-JANGTQ)
+for the deeper architecture writeup.
+## Loading (Python)
+```bash
+pip install mlx-lm jang-tools
+```
+```python
+from mlx_lm import load, generate
+model, tokenizer = load("OsaurusAI/Ling-2.6-flash-MXFP4")
+```
+Stock `mlx_lm.load()` works once `mlx_lm/models/bailing_hybrid.py` is
+present (shipped with `jang-tools >= TBD`). The bundle's
+`configuration_bailing_moe_v2_5.py` and `modeling_bailing_moe_v2_5.py`
+provide HF compatibility for tooling that goes through transformers.
+## Reasoning + tools
+Default is **`detailed thinking off`**. To enable:
+```python
+messages = [
+    {"role": "system", "content": "detailed thinking on"},
+    {"role": "user",   "content": "..."},
+]
+```
+The model emits `<think>...</think>` reasoning blocks before answers when
+thinking is on. DeepSeek-style tool-call format.
+## Credits
+- **Quantization + MLX runtime:** Jinho Jang ([eric@osaurus.ai](mailto:eric@osaurus.ai))
+- **Base model:** [inclusionAI](https://huggingface.co/inclusionAI) — Ant
+  Group's Bailing team
+- **Architecture references:** Lightning-Attention-2 (arXiv:2401.04658),
+  DeepSeek-V2/V3 MLA, MTP from DeepSeek-V3
+- **Osaurus:** [osaurus.ai](https://osaurus.ai) — Apple-Silicon-first
+  inference for open-weight LLMs.