| --- |
| license: mit |
| tags: |
| - moe |
| - mixture-of-experts |
| - hybrid-attention |
| - mla |
| - lightning-attention |
| - mxfp4 |
| - osaurus |
| - mlx |
| - bailing |
| - ling |
| - apple-silicon |
| base_model: inclusionAI/Ling-2.6-flash |
| pipeline_tag: text-generation |
| library_name: mlx |
| --- |
| |
| <p align="center"><img src="osaurus-x-banner.png" width="100%"/></p> |
|
|
| # Ling-2.6-flash-MXFP4 |
|
|
| **~103B-A8B hybrid MoE β 63 GB on disk** (down from the 200 GB bf16 source) β |
| **stock 4-bit affine** quantization on inclusionAI's Bailing-V2.5 hybrid |
| architecture. Loads via `mlx_lm.load()` with the `bailing_hybrid` model |
| class β no TurboQuant runtime, no sidecar required. |
|
|
| - **Source:** [inclusionAI/Ling-2.6-flash](https://huggingface.co/inclusionAI/Ling-2.6-flash) |
| (Ant Group's Bailing-V2.5 hybrid: 32 layers MLA + Lightning-Linear-Attention, |
| 256 experts top-8, MTP head, 131K context) |
| - **Quantization:** MXFP4 β every weight (routed experts, attention, |
| shared experts, dense MLP, embed, lm_head) at **4-bit affine |
| group_size=32**. Norms, router gates, expert biases, and slopes stay |
| fp16/fp32 passthrough. |
| - **Bundle size:** **63 GB on-disk** across 51 shards |
| - **Runs on:** M3 Max 96 GB+ / M4 Max 128 GB / M5 Max 128 GB / Mac Studio |
| |
| ## Why two variants? |
| |
| | | JANGTQ2 | MXFP4 | |
| |---|---|---| |
| | Routed experts | 2-bit MXTQ codebook (Hadamard + Lloyd-Max) | 4-bit affine | |
| | Attention / shared / dense | 8-bit affine | 4-bit affine | |
| | Bundle size | 30 GB | 63 GB | |
| | Quality | tighter (8-bit attention) | uniform 4-bit | |
| | Loader | `jang_tools.load_jangtq` (TurboQuant kernel) | stock `mlx_lm.load()` | |
| | Sidecar | required | not needed | |
| | Min RAM | 64 GB | 96 GB | |
|
|
| JANGTQ2 trades cheap-but-slow MXTQ codec on the routed experts for a tighter |
| overall bit budget. MXFP4 is the simpler "just-works-with-stock-MLX" option |
| for users who don't want the TurboQuant runtime in their stack. |
|
|
| ## Architecture (`bailing_hybrid`) |
| |
| Hybrid attention β every 8th layer is full softmax MLA, the other 28 of 32 |
| are Lightning-Linear-Attention. Plus a Multi-Token Prediction head. |
| |
| | Layer block | Count | Attention | MLP | |
| |---|---|---|---| |
| | Layer 0 | 1 | Linear (GLA) | Dense MLP (intermediate=9216) | |
| | Layers 1β6, 8β14, 16β22, 24β30 | 27 | **Linear (GLA)** | MoE (256+1) | |
| | Layers 7, 15, 23, 31 | 4 | **MLA** (full softmax) | MoE (256+1) | |
| | MTP head (32) | 1 | MLA | MoE (256+1) | |
| |
| See the [JANGTQ variant card](https://huggingface.co/JANGQ-AI/Ling-2.6-flash-JANGTQ) |
| for the deeper architecture writeup. |
| |
| ## Loading (Python) |
| |
| ```bash |
| pip install mlx-lm jang-tools |
| ``` |
| |
| ```python |
| from mlx_lm import load, generate |
| model, tokenizer = load("OsaurusAI/Ling-2.6-flash-MXFP4") |
| ``` |
| |
| Stock `mlx_lm.load()` works once `mlx_lm/models/bailing_hybrid.py` is |
| present (shipped with `jang-tools >= TBD`). The bundle's |
| `configuration_bailing_moe_v2_5.py` and `modeling_bailing_moe_v2_5.py` |
| provide HF compatibility for tooling that goes through transformers. |
| |
| ## Reasoning + tools |
| |
| Default is **`detailed thinking off`**. To enable: |
| |
| ```python |
| messages = [ |
| {"role": "system", "content": "detailed thinking on"}, |
| {"role": "user", "content": "..."}, |
| ] |
| ``` |
| |
| The model emits `<think>...</think>` reasoning blocks before answers when |
| thinking is on. DeepSeek-style tool-call format. |
|
|
| ## Credits |
|
|
| - **Quantization + MLX runtime:** Jinho Jang ([eric@osaurus.ai](mailto:eric@osaurus.ai)) |
| - **Base model:** [inclusionAI](https://huggingface.co/inclusionAI) β Ant |
| Group's Bailing team |
| - **Architecture references:** Lightning-Attention-2 (arXiv:2401.04658), |
| DeepSeek-V2/V3 MLA, MTP from DeepSeek-V3 |
| - **Osaurus:** [osaurus.ai](https://osaurus.ai) β Apple-Silicon-first |
| inference for open-weight LLMs. |
|
|