Osaurus-AI's picture
Upload README.md with huggingface_hub
c1eb035 verified
---
license: mit
tags:
- moe
- mixture-of-experts
- hybrid-attention
- mla
- lightning-attention
- jangtq
- osaurus
- mlx
- bailing
- ling
- apple-silicon
base_model: inclusionAI/Ling-2.6-flash
pipeline_tag: text-generation
library_name: mlx
---
<p align="center"><img src="osaurus-x-banner.png" width="100%"/></p>
# Ling-2.6-flash-JANGTQ
**~103B-A8B hybrid MoE β€” 30 GB on disk** (down from the 200 GB bf16 source) β€”
mixed-precision **JANGTQ2** quantization on inclusionAI's Bailing-V2.5 hybrid
architecture. Runs on a 64 GB+ Apple Silicon Mac.
- **Source:** [inclusionAI/Ling-2.6-flash](https://huggingface.co/inclusionAI/Ling-2.6-flash)
(Ant Group's Bailing-V2.5 hybrid: 32 layers MLA + Lightning-Linear-Attention,
256 experts top-8, MTP head, 131K context)
- **Quantization:** JANGTQ2 β€” 2-bit MXTQ codebook (Hadamard-rotated,
Lloyd-Max optimized) on routed-expert weights + 8-bit affine on every
attention path (MLA `q_a`/`q_b`/`kv_a`/`kv_b`/`dense`, Linear-Attn
`query_key_value`/`g_proj`/`dense`) + 8-bit affine on shared experts /
layer-0 dense MLP / MTP `eh_proj` / embed / lm_head + fp16 passthrough on
every RMSNorm / GroupRMSNorm / router gate / `expert_bias` / `slope`
- **Bundle size:** **30 GB on-disk** across 31 shards (24,576 routed-expert
tensors at 2-bit + 211 affine 8-bit + 228 passthrough)
- **Runs on:** M2 Pro 64 GB / M3 Max 64 GB+ / M4 Max 128 GB / M5 Max 128 GB
## Architecture (`bailing_hybrid`)
This is a **hybrid attention** model β€” every 8th layer is full softmax MLA,
the other 28 layers (of 32) are Lightning-Linear-Attention (Gated Linear
Attention with per-head learned slopes). Plus a single Multi-Token
Prediction head for speculative decoding.
| Layer block | Count | Attention | MLP |
|---|---|---|---|
| Layer 0 | 1 | Linear (GLA) | Dense MLP (intermediate=9216) |
| Layers 1–6, 8–14, 16–22, 24–30 | 27 | **Linear (GLA)** | MoE (256+1) |
| Layers 7, 15, 23, 31 | 4 | **MLA** (full softmax) | MoE (256+1) |
| MTP head (32) | 1 | MLA | MoE (256+1) |
**MLA:** DeepSeek-V2-flavored β€” `q_lora_rank=1536`, `kv_lora_rank=512`,
`qk_nope_head_dim=128`, `qk_rope_head_dim=64` (interleaved, theta=6M),
`v_head_dim=128`. Materializes K and V (no absorb optimization).
**Linear attention:** Lightning-Attention-2 / Gated Linear Attention β€” fused
`query_key_value` projection, per-head RMSNorm on Q/K, partial RoPE (50% of
head_dim, adjacent-half rotation), per-head ALiBi-style slope scaled by
layer_idx, sigmoid-gated output through `g_proj` + `GroupRMSNorm` (groups=4).
**MoE routing:** sigmoid scoring + grouped routing (8 groups Γ— top-4 groups
+ top-8 within), `e_score_correction_bias` for noaux_tc β€” selection uses
biased scores, weighting uses original sigmoid scores. Identical math to
DeepSeek-V3 / Kimi-K2.6.
## What's in the bundle
| Module | Source dtype | Bundle dtype |
|---|---|---|
| Routed experts (256 Γ— 3 mats Γ— 32 MoE layers = 24,576 tensors) | bf16 | **2-bit MXTQ** + sidecar |
| Shared expert (per MoE layer) | bf16 | 8-bit affine g=64 |
| MLA attention (5 layers, q/q_a/q_b/kv_a/kv_b/dense) | bf16 | 8-bit affine g=64 |
| Linear attention (28 layers, query_key_value/dense/g_proj) | bf16 | 8-bit affine g=64 |
| Dense MLP (layer 0) | bf16 | 8-bit affine g=64 |
| MTP `eh_proj` | bf16 | 8-bit affine g=64 |
| `embed_tokens` / `lm_head` | bf16 | 8-bit affine g=64 |
| RMSNorm / GroupRMSNorm / router `gate.weight` | bf16 | fp16 passthrough |
| `expert_bias` (router score correction) | fp32 | fp32 passthrough |
`jangtq_runtime.safetensors` sidecar (20.8 KB) for Swift runtimes β€” covers
`(in_features={1024,4096}, seed=42, bits=2)` codebooks + sign-flip vectors.
## Loading (Python)
```bash
pip install jang-tools mlx-lm
```
```python
from jang_tools import load_jangtq
model, tokenizer = load_jangtq("OsaurusAI/Ling-2.6-flash-JANGTQ")
```
The bundle ships `configuration_bailing_moe_v2_5.py` and
`modeling_bailing_moe_v2_5.py` for HF compatibility, plus an `mlx_lm`
model class (`bailing_hybrid.py`) that handles the hybrid attention
dispatch, MLA, GLA recurrence, MoE routing, and MTP.
## Reasoning + tools
The chat template defaults to **`detailed thinking off`**. To enable
chain-of-thought:
```python
messages = [
{"role": "system", "content": "detailed thinking on"},
{"role": "user", "content": "..."},
]
```
The model emits `<think>...</think>` blocks before its answer when thinking
is on; runtimes binding to the `deepseek_r1` reasoning parser will extract
them automatically. Tool calls follow the DeepSeek tool-call block format.
## Credits
- **Quantization + MLX runtime:** Jinho Jang ([eric@osaurus.ai](mailto:eric@osaurus.ai))
- **Base model:** [inclusionAI](https://huggingface.co/inclusionAI) β€” Ant
Group's Bailing team
- **Architecture references:** Lightning-Attention-2 (arXiv:2401.04658),
DeepSeek-V2/V3 MLA, MTP from DeepSeek-V3