File size: 4,901 Bytes

c1eb035

---
license: mit
tags:
  - moe
  - mixture-of-experts
  - hybrid-attention
  - mla
  - lightning-attention
  - jangtq
  - osaurus
  - mlx
  - bailing
  - ling
  - apple-silicon
base_model: inclusionAI/Ling-2.6-flash
pipeline_tag: text-generation
library_name: mlx
---

<p align="center"><img src="osaurus-x-banner.png" width="100%"/></p>

# Ling-2.6-flash-JANGTQ

**~103B-A8B hybrid MoE — 30 GB on disk** (down from the 200 GB bf16 source) —
mixed-precision **JANGTQ2** quantization on inclusionAI's Bailing-V2.5 hybrid
architecture. Runs on a 64 GB+ Apple Silicon Mac.

- **Source:** [inclusionAI/Ling-2.6-flash](https://huggingface.co/inclusionAI/Ling-2.6-flash)
  (Ant Group's Bailing-V2.5 hybrid: 32 layers MLA + Lightning-Linear-Attention,
  256 experts top-8, MTP head, 131K context)
- **Quantization:** JANGTQ2 — 2-bit MXTQ codebook (Hadamard-rotated,
  Lloyd-Max optimized) on routed-expert weights + 8-bit affine on every
  attention path (MLA `q_a`/`q_b`/`kv_a`/`kv_b`/`dense`, Linear-Attn
  `query_key_value`/`g_proj`/`dense`) + 8-bit affine on shared experts /
  layer-0 dense MLP / MTP `eh_proj` / embed / lm_head + fp16 passthrough on
  every RMSNorm / GroupRMSNorm / router gate / `expert_bias` / `slope`
- **Bundle size:** **30 GB on-disk** across 31 shards (24,576 routed-expert
  tensors at 2-bit + 211 affine 8-bit + 228 passthrough)
- **Runs on:** M2 Pro 64 GB / M3 Max 64 GB+ / M4 Max 128 GB / M5 Max 128 GB

## Architecture (`bailing_hybrid`)

This is a **hybrid attention** model — every 8th layer is full softmax MLA,
the other 28 layers (of 32) are Lightning-Linear-Attention (Gated Linear
Attention with per-head learned slopes). Plus a single Multi-Token
Prediction head for speculative decoding.

| Layer block | Count | Attention | MLP |
|---|---|---|---|
| Layer 0 | 1 | Linear (GLA) | Dense MLP (intermediate=9216) |
| Layers 1–6, 8–14, 16–22, 24–30 | 27 | **Linear (GLA)** | MoE (256+1) |
| Layers 7, 15, 23, 31 | 4 | **MLA** (full softmax) | MoE (256+1) |
| MTP head (32) | 1 | MLA | MoE (256+1) |

**MLA:** DeepSeek-V2-flavored — `q_lora_rank=1536`, `kv_lora_rank=512`,
`qk_nope_head_dim=128`, `qk_rope_head_dim=64` (interleaved, theta=6M),
`v_head_dim=128`. Materializes K and V (no absorb optimization).

**Linear attention:** Lightning-Attention-2 / Gated Linear Attention — fused
`query_key_value` projection, per-head RMSNorm on Q/K, partial RoPE (50% of
head_dim, adjacent-half rotation), per-head ALiBi-style slope scaled by
layer_idx, sigmoid-gated output through `g_proj` + `GroupRMSNorm` (groups=4).

**MoE routing:** sigmoid scoring + grouped routing (8 groups × top-4 groups
+ top-8 within), `e_score_correction_bias` for noaux_tc — selection uses
biased scores, weighting uses original sigmoid scores. Identical math to
DeepSeek-V3 / Kimi-K2.6.

## What's in the bundle

| Module | Source dtype | Bundle dtype |
|---|---|---|
| Routed experts (256 × 3 mats × 32 MoE layers = 24,576 tensors) | bf16 | **2-bit MXTQ** + sidecar |
| Shared expert (per MoE layer) | bf16 | 8-bit affine g=64 |
| MLA attention (5 layers, q/q_a/q_b/kv_a/kv_b/dense) | bf16 | 8-bit affine g=64 |
| Linear attention (28 layers, query_key_value/dense/g_proj) | bf16 | 8-bit affine g=64 |
| Dense MLP (layer 0) | bf16 | 8-bit affine g=64 |
| MTP `eh_proj` | bf16 | 8-bit affine g=64 |
| `embed_tokens` / `lm_head` | bf16 | 8-bit affine g=64 |
| RMSNorm / GroupRMSNorm / router `gate.weight` | bf16 | fp16 passthrough |
| `expert_bias` (router score correction) | fp32 | fp32 passthrough |

`jangtq_runtime.safetensors` sidecar (20.8 KB) for Swift runtimes — covers
`(in_features={1024,4096}, seed=42, bits=2)` codebooks + sign-flip vectors.

## Loading (Python)

```bash
pip install jang-tools mlx-lm
```

```python
from jang_tools import load_jangtq
model, tokenizer = load_jangtq("OsaurusAI/Ling-2.6-flash-JANGTQ")
```

The bundle ships `configuration_bailing_moe_v2_5.py` and
`modeling_bailing_moe_v2_5.py` for HF compatibility, plus an `mlx_lm`
model class (`bailing_hybrid.py`) that handles the hybrid attention
dispatch, MLA, GLA recurrence, MoE routing, and MTP.

## Reasoning + tools

The chat template defaults to **`detailed thinking off`**. To enable
chain-of-thought:

```python
messages = [
    {"role": "system", "content": "detailed thinking on"},
    {"role": "user",   "content": "..."},
]
```

The model emits `<think>...</think>` blocks before its answer when thinking
is on; runtimes binding to the `deepseek_r1` reasoning parser will extract
them automatically. Tool calls follow the DeepSeek tool-call block format.

## Credits

- **Quantization + MLX runtime:** Jinho Jang ([eric@osaurus.ai](mailto:eric@osaurus.ai))
- **Base model:** [inclusionAI](https://huggingface.co/inclusionAI) — Ant
  Group's Bailing team
- **Architecture references:** Lightning-Attention-2 (arXiv:2401.04658),
  DeepSeek-V2/V3 MLA, MTP from DeepSeek-V3