--- license: mit tags: - moe - mixture-of-experts - hybrid-attention - mla - lightning-attention - jangtq - osaurus - mlx - bailing - ling - apple-silicon base_model: inclusionAI/Ling-2.6-flash pipeline_tag: text-generation library_name: mlx ---

# Ling-2.6-flash-JANGTQ **~103B-A8B hybrid MoE — 30 GB on disk** (down from the 200 GB bf16 source) — mixed-precision **JANGTQ2** quantization on inclusionAI's Bailing-V2.5 hybrid architecture. Runs on a 64 GB+ Apple Silicon Mac. - **Source:** [inclusionAI/Ling-2.6-flash](https://huggingface.co/inclusionAI/Ling-2.6-flash) (Ant Group's Bailing-V2.5 hybrid: 32 layers MLA + Lightning-Linear-Attention, 256 experts top-8, MTP head, 131K context) - **Quantization:** JANGTQ2 — 2-bit MXTQ codebook (Hadamard-rotated, Lloyd-Max optimized) on routed-expert weights + 8-bit affine on every attention path (MLA `q_a`/`q_b`/`kv_a`/`kv_b`/`dense`, Linear-Attn `query_key_value`/`g_proj`/`dense`) + 8-bit affine on shared experts / layer-0 dense MLP / MTP `eh_proj` / embed / lm_head + fp16 passthrough on every RMSNorm / GroupRMSNorm / router gate / `expert_bias` / `slope` - **Bundle size:** **30 GB on-disk** across 31 shards (24,576 routed-expert tensors at 2-bit + 211 affine 8-bit + 228 passthrough) - **Runs on:** M2 Pro 64 GB / M3 Max 64 GB+ / M4 Max 128 GB / M5 Max 128 GB ## Architecture (`bailing_hybrid`) This is a **hybrid attention** model — every 8th layer is full softmax MLA, the other 28 layers (of 32) are Lightning-Linear-Attention (Gated Linear Attention with per-head learned slopes). Plus a single Multi-Token Prediction head for speculative decoding. | Layer block | Count | Attention | MLP | |---|---|---|---| | Layer 0 | 1 | Linear (GLA) | Dense MLP (intermediate=9216) | | Layers 1–6, 8–14, 16–22, 24–30 | 27 | **Linear (GLA)** | MoE (256+1) | | Layers 7, 15, 23, 31 | 4 | **MLA** (full softmax) | MoE (256+1) | | MTP head (32) | 1 | MLA | MoE (256+1) | **MLA:** DeepSeek-V2-flavored — `q_lora_rank=1536`, `kv_lora_rank=512`, `qk_nope_head_dim=128`, `qk_rope_head_dim=64` (interleaved, theta=6M), `v_head_dim=128`. Materializes K and V (no absorb optimization). **Linear attention:** Lightning-Attention-2 / Gated Linear Attention — fused `query_key_value` projection, per-head RMSNorm on Q/K, partial RoPE (50% of head_dim, adjacent-half rotation), per-head ALiBi-style slope scaled by layer_idx, sigmoid-gated output through `g_proj` + `GroupRMSNorm` (groups=4). **MoE routing:** sigmoid scoring + grouped routing (8 groups × top-4 groups + top-8 within), `e_score_correction_bias` for noaux_tc — selection uses biased scores, weighting uses original sigmoid scores. Identical math to DeepSeek-V3 / Kimi-K2.6. ## What's in the bundle | Module | Source dtype | Bundle dtype | |---|---|---| | Routed experts (256 × 3 mats × 32 MoE layers = 24,576 tensors) | bf16 | **2-bit MXTQ** + sidecar | | Shared expert (per MoE layer) | bf16 | 8-bit affine g=64 | | MLA attention (5 layers, q/q_a/q_b/kv_a/kv_b/dense) | bf16 | 8-bit affine g=64 | | Linear attention (28 layers, query_key_value/dense/g_proj) | bf16 | 8-bit affine g=64 | | Dense MLP (layer 0) | bf16 | 8-bit affine g=64 | | MTP `eh_proj` | bf16 | 8-bit affine g=64 | | `embed_tokens` / `lm_head` | bf16 | 8-bit affine g=64 | | RMSNorm / GroupRMSNorm / router `gate.weight` | bf16 | fp16 passthrough | | `expert_bias` (router score correction) | fp32 | fp32 passthrough | `jangtq_runtime.safetensors` sidecar (20.8 KB) for Swift runtimes — covers `(in_features={1024,4096}, seed=42, bits=2)` codebooks + sign-flip vectors. ## Loading (Python) ```bash pip install jang-tools mlx-lm ``` ```python from jang_tools import load_jangtq model, tokenizer = load_jangtq("OsaurusAI/Ling-2.6-flash-JANGTQ") ``` The bundle ships `configuration_bailing_moe_v2_5.py` and `modeling_bailing_moe_v2_5.py` for HF compatibility, plus an `mlx_lm` model class (`bailing_hybrid.py`) that handles the hybrid attention dispatch, MLA, GLA recurrence, MoE routing, and MTP. ## Reasoning + tools The chat template defaults to **`detailed thinking off`**. To enable chain-of-thought: ```python messages = [ {"role": "system", "content": "detailed thinking on"}, {"role": "user", "content": "..."}, ] ``` The model emits `...` blocks before its answer when thinking is on; runtimes binding to the `deepseek_r1` reasoning parser will extract them automatically. Tool calls follow the DeepSeek tool-call block format. ## Credits - **Quantization + MLX runtime:** Jinho Jang ([eric@osaurus.ai](mailto:eric@osaurus.ai)) - **Base model:** [inclusionAI](https://huggingface.co/inclusionAI) — Ant Group's Bailing team - **Architecture references:** Lightning-Attention-2 (arXiv:2401.04658), DeepSeek-V2/V3 MLA, MTP from DeepSeek-V3