--- license: mit tags: - moe - mixture-of-experts - hybrid-attention - mla - lightning-attention - mxfp4 - osaurus - mlx - bailing - ling - apple-silicon base_model: inclusionAI/Ling-2.6-flash pipeline_tag: text-generation library_name: mlx ---

# Ling-2.6-flash-MXFP4 **~103B-A8B hybrid MoE — 63 GB on disk** (down from the 200 GB bf16 source) — **stock 4-bit affine** quantization on inclusionAI's Bailing-V2.5 hybrid architecture. Loads via `mlx_lm.load()` with the `bailing_hybrid` model class — no TurboQuant runtime, no sidecar required. - **Source:** [inclusionAI/Ling-2.6-flash](https://huggingface.co/inclusionAI/Ling-2.6-flash) (Ant Group's Bailing-V2.5 hybrid: 32 layers MLA + Lightning-Linear-Attention, 256 experts top-8, MTP head, 131K context) - **Quantization:** MXFP4 — every weight (routed experts, attention, shared experts, dense MLP, embed, lm_head) at **4-bit affine group_size=32**. Norms, router gates, expert biases, and slopes stay fp16/fp32 passthrough. - **Bundle size:** **63 GB on-disk** across 51 shards - **Runs on:** M3 Max 96 GB+ / M4 Max 128 GB / M5 Max 128 GB / Mac Studio ## Why two variants? | | JANGTQ2 | MXFP4 | |---|---|---| | Routed experts | 2-bit MXTQ codebook (Hadamard + Lloyd-Max) | 4-bit affine | | Attention / shared / dense | 8-bit affine | 4-bit affine | | Bundle size | 30 GB | 63 GB | | Quality | tighter (8-bit attention) | uniform 4-bit | | Loader | `jang_tools.load_jangtq` (TurboQuant kernel) | stock `mlx_lm.load()` | | Sidecar | required | not needed | | Min RAM | 64 GB | 96 GB | JANGTQ2 trades cheap-but-slow MXTQ codec on the routed experts for a tighter overall bit budget. MXFP4 is the simpler "just-works-with-stock-MLX" option for users who don't want the TurboQuant runtime in their stack. ## Architecture (`bailing_hybrid`) Hybrid attention — every 8th layer is full softmax MLA, the other 28 of 32 are Lightning-Linear-Attention. Plus a Multi-Token Prediction head. | Layer block | Count | Attention | MLP | |---|---|---|---| | Layer 0 | 1 | Linear (GLA) | Dense MLP (intermediate=9216) | | Layers 1–6, 8–14, 16–22, 24–30 | 27 | **Linear (GLA)** | MoE (256+1) | | Layers 7, 15, 23, 31 | 4 | **MLA** (full softmax) | MoE (256+1) | | MTP head (32) | 1 | MLA | MoE (256+1) | See the [JANGTQ variant card](https://huggingface.co/JANGQ-AI/Ling-2.6-flash-JANGTQ) for the deeper architecture writeup. ## Loading (Python) ```bash pip install mlx-lm jang-tools ``` ```python from mlx_lm import load, generate model, tokenizer = load("OsaurusAI/Ling-2.6-flash-MXFP4") ``` Stock `mlx_lm.load()` works once `mlx_lm/models/bailing_hybrid.py` is present (shipped with `jang-tools >= TBD`). The bundle's `configuration_bailing_moe_v2_5.py` and `modeling_bailing_moe_v2_5.py` provide HF compatibility for tooling that goes through transformers. ## Reasoning + tools Default is **`detailed thinking off`**. To enable: ```python messages = [ {"role": "system", "content": "detailed thinking on"}, {"role": "user", "content": "..."}, ] ``` The model emits `...` reasoning blocks before answers when thinking is on. DeepSeek-style tool-call format. ## Credits - **Quantization + MLX runtime:** Jinho Jang ([eric@osaurus.ai](mailto:eric@osaurus.ai)) - **Base model:** [inclusionAI](https://huggingface.co/inclusionAI) — Ant Group's Bailing team - **Architecture references:** Lightning-Attention-2 (arXiv:2401.04658), DeepSeek-V2/V3 MLA, MTP from DeepSeek-V3 - **Osaurus:** [osaurus.ai](https://osaurus.ai) — Apple-Silicon-first inference for open-weight LLMs.