Upload README.md with huggingface_hub

c1eb035 verified 1 day ago

4.9 kB

license: mit
tags:
  - moe
  - mixture-of-experts
  - hybrid-attention
  - mla
  - lightning-attention
  - jangtq
  - osaurus
  - mlx
  - bailing
  - ling
  - apple-silicon
base_model: inclusionAI/Ling-2.6-flash
pipeline_tag: text-generation
library_name: mlx

Ling-2.6-flash-JANGTQ

~103B-A8B hybrid MoE — 30 GB on disk (down from the 200 GB bf16 source) — mixed-precision JANGTQ2 quantization on inclusionAI's Bailing-V2.5 hybrid architecture. Runs on a 64 GB+ Apple Silicon Mac.

Source: inclusionAI/Ling-2.6-flash (Ant Group's Bailing-V2.5 hybrid: 32 layers MLA + Lightning-Linear-Attention, 256 experts top-8, MTP head, 131K context)
Quantization: JANGTQ2 — 2-bit MXTQ codebook (Hadamard-rotated, Lloyd-Max optimized) on routed-expert weights + 8-bit affine on every attention path (MLA q_a/q_b/kv_a/kv_b/dense, Linear-Attn query_key_value/g_proj/dense) + 8-bit affine on shared experts / layer-0 dense MLP / MTP eh_proj / embed / lm_head + fp16 passthrough on every RMSNorm / GroupRMSNorm / router gate / expert_bias / slope
Bundle size: 30 GB on-disk across 31 shards (24,576 routed-expert tensors at 2-bit + 211 affine 8-bit + 228 passthrough)
Runs on: M2 Pro 64 GB / M3 Max 64 GB+ / M4 Max 128 GB / M5 Max 128 GB

Architecture (`bailing_hybrid`)

This is a hybrid attention model — every 8th layer is full softmax MLA, the other 28 layers (of 32) are Lightning-Linear-Attention (Gated Linear Attention with per-head learned slopes). Plus a single Multi-Token Prediction head for speculative decoding.

Layer block	Count	Attention	MLP
Layer 0	1	Linear (GLA)	Dense MLP (intermediate=9216)
Layers 1–6, 8–14, 16–22, 24–30	27	Linear (GLA)	MoE (256+1)
Layers 7, 15, 23, 31	4	MLA (full softmax)	MoE (256+1)
MTP head (32)	1	MLA	MoE (256+1)

MLA: DeepSeek-V2-flavored — q_lora_rank=1536, kv_lora_rank=512, qk_nope_head_dim=128, qk_rope_head_dim=64 (interleaved, theta=6M), v_head_dim=128. Materializes K and V (no absorb optimization).

Linear attention: Lightning-Attention-2 / Gated Linear Attention — fused query_key_value projection, per-head RMSNorm on Q/K, partial RoPE (50% of head_dim, adjacent-half rotation), per-head ALiBi-style slope scaled by layer_idx, sigmoid-gated output through g_proj + GroupRMSNorm (groups=4).

MoE routing: sigmoid scoring + grouped routing (8 groups × top-4 groups

top-8 within), e_score_correction_bias for noaux_tc — selection uses biased scores, weighting uses original sigmoid scores. Identical math to DeepSeek-V3 / Kimi-K2.6.

What's in the bundle

Module	Source dtype	Bundle dtype
Routed experts (256 × 3 mats × 32 MoE layers = 24,576 tensors)	bf16	2-bit MXTQ + sidecar
Shared expert (per MoE layer)	bf16	8-bit affine g=64
MLA attention (5 layers, q/q_a/q_b/kv_a/kv_b/dense)	bf16	8-bit affine g=64
Linear attention (28 layers, query_key_value/dense/g_proj)	bf16	8-bit affine g=64
Dense MLP (layer 0)	bf16	8-bit affine g=64
MTP `eh_proj`	bf16	8-bit affine g=64
`embed_tokens` / `lm_head`	bf16	8-bit affine g=64
RMSNorm / GroupRMSNorm / router `gate.weight`	bf16	fp16 passthrough
`expert_bias` (router score correction)	fp32	fp32 passthrough

jangtq_runtime.safetensors sidecar (20.8 KB) for Swift runtimes — covers (in_features={1024,4096}, seed=42, bits=2) codebooks + sign-flip vectors.

Loading (Python)

pip install jang-tools mlx-lm

from jang_tools import load_jangtq
model, tokenizer = load_jangtq("OsaurusAI/Ling-2.6-flash-JANGTQ")

The bundle ships configuration_bailing_moe_v2_5.py and modeling_bailing_moe_v2_5.py for HF compatibility, plus an mlx_lm model class (bailing_hybrid.py) that handles the hybrid attention dispatch, MLA, GLA recurrence, MoE routing, and MTP.

Reasoning + tools

The chat template defaults to detailed thinking off. To enable chain-of-thought:

messages = [
    {"role": "system", "content": "detailed thinking on"},
    {"role": "user",   "content": "..."},
]

The model emits <think>...</think> blocks before its answer when thinking is on; runtimes binding to the deepseek_r1 reasoning parser will extract them automatically. Tool calls follow the DeepSeek tool-call block format.

Credits

Quantization + MLX runtime: Jinho Jang (eric@osaurus.ai)
Base model: inclusionAI — Ant Group's Bailing team
Architecture references: Lightning-Attention-2 (arXiv:2401.04658), DeepSeek-V2/V3 MLA, MTP from DeepSeek-V3