| --- |
| license: mit |
| tags: |
| - moe |
| - mixture-of-experts |
| - hybrid-attention |
| - mla |
| - lightning-attention |
| - jangtq |
| - osaurus |
| - mlx |
| - bailing |
| - ling |
| - apple-silicon |
| base_model: inclusionAI/Ling-2.6-flash |
| pipeline_tag: text-generation |
| library_name: mlx |
| --- |
| |
| <p align="center"><img src="osaurus-x-banner.png" width="100%"/></p> |
|
|
| # Ling-2.6-flash-JANGTQ |
|
|
| **~103B-A8B hybrid MoE β 30 GB on disk** (down from the 200 GB bf16 source) β |
| mixed-precision **JANGTQ2** quantization on inclusionAI's Bailing-V2.5 hybrid |
| architecture. Runs on a 64 GB+ Apple Silicon Mac. |
|
|
| - **Source:** [inclusionAI/Ling-2.6-flash](https://huggingface.co/inclusionAI/Ling-2.6-flash) |
| (Ant Group's Bailing-V2.5 hybrid: 32 layers MLA + Lightning-Linear-Attention, |
| 256 experts top-8, MTP head, 131K context) |
| - **Quantization:** JANGTQ2 β 2-bit MXTQ codebook (Hadamard-rotated, |
| Lloyd-Max optimized) on routed-expert weights + 8-bit affine on every |
| attention path (MLA `q_a`/`q_b`/`kv_a`/`kv_b`/`dense`, Linear-Attn |
| `query_key_value`/`g_proj`/`dense`) + 8-bit affine on shared experts / |
| layer-0 dense MLP / MTP `eh_proj` / embed / lm_head + fp16 passthrough on |
| every RMSNorm / GroupRMSNorm / router gate / `expert_bias` / `slope` |
| - **Bundle size:** **30 GB on-disk** across 31 shards (24,576 routed-expert |
| tensors at 2-bit + 211 affine 8-bit + 228 passthrough) |
| - **Runs on:** M2 Pro 64 GB / M3 Max 64 GB+ / M4 Max 128 GB / M5 Max 128 GB |
|
|
| ## Architecture (`bailing_hybrid`) |
| |
| This is a **hybrid attention** model β every 8th layer is full softmax MLA, |
| the other 28 layers (of 32) are Lightning-Linear-Attention (Gated Linear |
| Attention with per-head learned slopes). Plus a single Multi-Token |
| Prediction head for speculative decoding. |
| |
| | Layer block | Count | Attention | MLP | |
| |---|---|---|---| |
| | Layer 0 | 1 | Linear (GLA) | Dense MLP (intermediate=9216) | |
| | Layers 1β6, 8β14, 16β22, 24β30 | 27 | **Linear (GLA)** | MoE (256+1) | |
| | Layers 7, 15, 23, 31 | 4 | **MLA** (full softmax) | MoE (256+1) | |
| | MTP head (32) | 1 | MLA | MoE (256+1) | |
| |
| **MLA:** DeepSeek-V2-flavored β `q_lora_rank=1536`, `kv_lora_rank=512`, |
| `qk_nope_head_dim=128`, `qk_rope_head_dim=64` (interleaved, theta=6M), |
| `v_head_dim=128`. Materializes K and V (no absorb optimization). |
| |
| **Linear attention:** Lightning-Attention-2 / Gated Linear Attention β fused |
| `query_key_value` projection, per-head RMSNorm on Q/K, partial RoPE (50% of |
| head_dim, adjacent-half rotation), per-head ALiBi-style slope scaled by |
| layer_idx, sigmoid-gated output through `g_proj` + `GroupRMSNorm` (groups=4). |
|
|
| **MoE routing:** sigmoid scoring + grouped routing (8 groups Γ top-4 groups |
| + top-8 within), `e_score_correction_bias` for noaux_tc β selection uses |
| biased scores, weighting uses original sigmoid scores. Identical math to |
| DeepSeek-V3 / Kimi-K2.6. |
| |
| ## What's in the bundle |
| |
| | Module | Source dtype | Bundle dtype | |
| |---|---|---| |
| | Routed experts (256 Γ 3 mats Γ 32 MoE layers = 24,576 tensors) | bf16 | **2-bit MXTQ** + sidecar | |
| | Shared expert (per MoE layer) | bf16 | 8-bit affine g=64 | |
| | MLA attention (5 layers, q/q_a/q_b/kv_a/kv_b/dense) | bf16 | 8-bit affine g=64 | |
| | Linear attention (28 layers, query_key_value/dense/g_proj) | bf16 | 8-bit affine g=64 | |
| | Dense MLP (layer 0) | bf16 | 8-bit affine g=64 | |
| | MTP `eh_proj` | bf16 | 8-bit affine g=64 | |
| | `embed_tokens` / `lm_head` | bf16 | 8-bit affine g=64 | |
| | RMSNorm / GroupRMSNorm / router `gate.weight` | bf16 | fp16 passthrough | |
| | `expert_bias` (router score correction) | fp32 | fp32 passthrough | |
|
|
| `jangtq_runtime.safetensors` sidecar (20.8 KB) for Swift runtimes β covers |
| `(in_features={1024,4096}, seed=42, bits=2)` codebooks + sign-flip vectors. |
|
|
| ## Loading (Python) |
|
|
| ```bash |
| pip install jang-tools mlx-lm |
| ``` |
|
|
| ```python |
| from jang_tools import load_jangtq |
| model, tokenizer = load_jangtq("OsaurusAI/Ling-2.6-flash-JANGTQ") |
| ``` |
|
|
| The bundle ships `configuration_bailing_moe_v2_5.py` and |
| `modeling_bailing_moe_v2_5.py` for HF compatibility, plus an `mlx_lm` |
| model class (`bailing_hybrid.py`) that handles the hybrid attention |
| dispatch, MLA, GLA recurrence, MoE routing, and MTP. |
|
|
| ## Reasoning + tools |
|
|
| The chat template defaults to **`detailed thinking off`**. To enable |
| chain-of-thought: |
|
|
| ```python |
| messages = [ |
| {"role": "system", "content": "detailed thinking on"}, |
| {"role": "user", "content": "..."}, |
| ] |
| ``` |
|
|
| The model emits `<think>...</think>` blocks before its answer when thinking |
| is on; runtimes binding to the `deepseek_r1` reasoning parser will extract |
| them automatically. Tool calls follow the DeepSeek tool-call block format. |
|
|
| ## Credits |
|
|
| - **Quantization + MLX runtime:** Jinho Jang ([eric@osaurus.ai](mailto:eric@osaurus.ai)) |
| - **Base model:** [inclusionAI](https://huggingface.co/inclusionAI) β Ant |
| Group's Bailing team |
| - **Architecture references:** Lightning-Attention-2 (arXiv:2401.04658), |
| DeepSeek-V2/V3 MLA, MTP from DeepSeek-V3 |
|
|