license: mit
tags:
- moe
- mixture-of-experts
- hybrid-attention
- mla
- lightning-attention
- jangtq
- osaurus
- mlx
- bailing
- ling
- apple-silicon
base_model: inclusionAI/Ling-2.6-flash
pipeline_tag: text-generation
library_name: mlx

Ling-2.6-flash-JANGTQ
~103B-A8B hybrid MoE β 30 GB on disk (down from the 200 GB bf16 source) β mixed-precision JANGTQ2 quantization on inclusionAI's Bailing-V2.5 hybrid architecture. Runs on a 64 GB+ Apple Silicon Mac.
- Source: inclusionAI/Ling-2.6-flash (Ant Group's Bailing-V2.5 hybrid: 32 layers MLA + Lightning-Linear-Attention, 256 experts top-8, MTP head, 131K context)
- Quantization: JANGTQ2 β 2-bit MXTQ codebook (Hadamard-rotated,
Lloyd-Max optimized) on routed-expert weights + 8-bit affine on every
attention path (MLA
q_a/q_b/kv_a/kv_b/dense, Linear-Attnquery_key_value/g_proj/dense) + 8-bit affine on shared experts / layer-0 dense MLP / MTPeh_proj/ embed / lm_head + fp16 passthrough on every RMSNorm / GroupRMSNorm / router gate /expert_bias/slope - Bundle size: 30 GB on-disk across 31 shards (24,576 routed-expert tensors at 2-bit + 211 affine 8-bit + 228 passthrough)
- Runs on: M2 Pro 64 GB / M3 Max 64 GB+ / M4 Max 128 GB / M5 Max 128 GB
Architecture (bailing_hybrid)
This is a hybrid attention model β every 8th layer is full softmax MLA, the other 28 layers (of 32) are Lightning-Linear-Attention (Gated Linear Attention with per-head learned slopes). Plus a single Multi-Token Prediction head for speculative decoding.
| Layer block | Count | Attention | MLP |
|---|---|---|---|
| Layer 0 | 1 | Linear (GLA) | Dense MLP (intermediate=9216) |
| Layers 1β6, 8β14, 16β22, 24β30 | 27 | Linear (GLA) | MoE (256+1) |
| Layers 7, 15, 23, 31 | 4 | MLA (full softmax) | MoE (256+1) |
| MTP head (32) | 1 | MLA | MoE (256+1) |
MLA: DeepSeek-V2-flavored β q_lora_rank=1536, kv_lora_rank=512,
qk_nope_head_dim=128, qk_rope_head_dim=64 (interleaved, theta=6M),
v_head_dim=128. Materializes K and V (no absorb optimization).
Linear attention: Lightning-Attention-2 / Gated Linear Attention β fused
query_key_value projection, per-head RMSNorm on Q/K, partial RoPE (50% of
head_dim, adjacent-half rotation), per-head ALiBi-style slope scaled by
layer_idx, sigmoid-gated output through g_proj + GroupRMSNorm (groups=4).
MoE routing: sigmoid scoring + grouped routing (8 groups Γ top-4 groups
- top-8 within),
e_score_correction_biasfor noaux_tc β selection uses biased scores, weighting uses original sigmoid scores. Identical math to DeepSeek-V3 / Kimi-K2.6.
What's in the bundle
| Module | Source dtype | Bundle dtype |
|---|---|---|
| Routed experts (256 Γ 3 mats Γ 32 MoE layers = 24,576 tensors) | bf16 | 2-bit MXTQ + sidecar |
| Shared expert (per MoE layer) | bf16 | 8-bit affine g=64 |
| MLA attention (5 layers, q/q_a/q_b/kv_a/kv_b/dense) | bf16 | 8-bit affine g=64 |
| Linear attention (28 layers, query_key_value/dense/g_proj) | bf16 | 8-bit affine g=64 |
| Dense MLP (layer 0) | bf16 | 8-bit affine g=64 |
MTP eh_proj |
bf16 | 8-bit affine g=64 |
embed_tokens / lm_head |
bf16 | 8-bit affine g=64 |
RMSNorm / GroupRMSNorm / router gate.weight |
bf16 | fp16 passthrough |
expert_bias (router score correction) |
fp32 | fp32 passthrough |
jangtq_runtime.safetensors sidecar (20.8 KB) for Swift runtimes β covers
(in_features={1024,4096}, seed=42, bits=2) codebooks + sign-flip vectors.
Loading (Python)
pip install jang-tools mlx-lm
from jang_tools import load_jangtq
model, tokenizer = load_jangtq("OsaurusAI/Ling-2.6-flash-JANGTQ")
The bundle ships configuration_bailing_moe_v2_5.py and
modeling_bailing_moe_v2_5.py for HF compatibility, plus an mlx_lm
model class (bailing_hybrid.py) that handles the hybrid attention
dispatch, MLA, GLA recurrence, MoE routing, and MTP.
Reasoning + tools
The chat template defaults to detailed thinking off. To enable
chain-of-thought:
messages = [
{"role": "system", "content": "detailed thinking on"},
{"role": "user", "content": "..."},
]
The model emits <think>...</think> blocks before its answer when thinking
is on; runtimes binding to the deepseek_r1 reasoning parser will extract
them automatically. Tool calls follow the DeepSeek tool-call block format.
Credits
- Quantization + MLX runtime: Jinho Jang (eric@osaurus.ai)
- Base model: inclusionAI β Ant Group's Bailing team
- Architecture references: Lightning-Attention-2 (arXiv:2401.04658), DeepSeek-V2/V3 MLA, MTP from DeepSeek-V3