license: mit
tags:
- moe
- mixture-of-experts
- hybrid-attention
- mla
- lightning-attention
- mxfp4
- osaurus
- mlx
- bailing
- ling
- apple-silicon
base_model: inclusionAI/Ling-2.6-flash
pipeline_tag: text-generation
library_name: mlx

Ling-2.6-flash-MXFP4
~103B-A8B hybrid MoE β 63 GB on disk (down from the 200 GB bf16 source) β
stock 4-bit affine quantization on inclusionAI's Bailing-V2.5 hybrid
architecture. Loads via mlx_lm.load() with the bailing_hybrid model
class β no TurboQuant runtime, no sidecar required.
- Source: inclusionAI/Ling-2.6-flash (Ant Group's Bailing-V2.5 hybrid: 32 layers MLA + Lightning-Linear-Attention, 256 experts top-8, MTP head, 131K context)
- Quantization: MXFP4 β every weight (routed experts, attention, shared experts, dense MLP, embed, lm_head) at 4-bit affine group_size=32. Norms, router gates, expert biases, and slopes stay fp16/fp32 passthrough.
- Bundle size: 63 GB on-disk across 51 shards
- Runs on: M3 Max 96 GB+ / M4 Max 128 GB / M5 Max 128 GB / Mac Studio
Why two variants?
| JANGTQ2 | MXFP4 | |
|---|---|---|
| Routed experts | 2-bit MXTQ codebook (Hadamard + Lloyd-Max) | 4-bit affine |
| Attention / shared / dense | 8-bit affine | 4-bit affine |
| Bundle size | 30 GB | 63 GB |
| Quality | tighter (8-bit attention) | uniform 4-bit |
| Loader | jang_tools.load_jangtq (TurboQuant kernel) |
stock mlx_lm.load() |
| Sidecar | required | not needed |
| Min RAM | 64 GB | 96 GB |
JANGTQ2 trades cheap-but-slow MXTQ codec on the routed experts for a tighter overall bit budget. MXFP4 is the simpler "just-works-with-stock-MLX" option for users who don't want the TurboQuant runtime in their stack.
Architecture (bailing_hybrid)
Hybrid attention β every 8th layer is full softmax MLA, the other 28 of 32 are Lightning-Linear-Attention. Plus a Multi-Token Prediction head.
| Layer block | Count | Attention | MLP |
|---|---|---|---|
| Layer 0 | 1 | Linear (GLA) | Dense MLP (intermediate=9216) |
| Layers 1β6, 8β14, 16β22, 24β30 | 27 | Linear (GLA) | MoE (256+1) |
| Layers 7, 15, 23, 31 | 4 | MLA (full softmax) | MoE (256+1) |
| MTP head (32) | 1 | MLA | MoE (256+1) |
See the JANGTQ variant card for the deeper architecture writeup.
Loading (Python)
pip install mlx-lm jang-tools
from mlx_lm import load, generate
model, tokenizer = load("OsaurusAI/Ling-2.6-flash-MXFP4")
Stock mlx_lm.load() works once mlx_lm/models/bailing_hybrid.py is
present (shipped with jang-tools >= TBD). The bundle's
configuration_bailing_moe_v2_5.py and modeling_bailing_moe_v2_5.py
provide HF compatibility for tooling that goes through transformers.
Reasoning + tools
Default is detailed thinking off. To enable:
messages = [
{"role": "system", "content": "detailed thinking on"},
{"role": "user", "content": "..."},
]
The model emits <think>...</think> reasoning blocks before answers when
thinking is on. DeepSeek-style tool-call format.
Credits
- Quantization + MLX runtime: Jinho Jang (eric@osaurus.ai)
- Base model: inclusionAI β Ant Group's Bailing team
- Architecture references: Lightning-Attention-2 (arXiv:2401.04658), DeepSeek-V2/V3 MLA, MTP from DeepSeek-V3
- Osaurus: osaurus.ai β Apple-Silicon-first inference for open-weight LLMs.