Ling-2.6-flash-JANGTQ

~103B-A8B hybrid MoE — 30 GB on disk (down from the 200 GB bf16 source) — mixed-precision JANGTQ2 quantization on inclusionAI's Bailing-V2.5 hybrid architecture. Runs on a 64 GB+ Apple Silicon Mac.

  • Source: inclusionAI/Ling-2.6-flash (Ant Group's Bailing-V2.5 hybrid: 32 layers MLA + Lightning-Linear-Attention, 256 experts top-8, MTP head, 131K context)
  • Quantization: JANGTQ2 — 2-bit MXTQ codebook (Hadamard-rotated, Lloyd-Max optimized) on routed-expert weights + 8-bit affine on every attention path (MLA q_a/q_b/kv_a/kv_b/dense, Linear-Attn query_key_value/g_proj/dense) + 8-bit affine on shared experts / layer-0 dense MLP / MTP eh_proj / embed / lm_head + fp16 passthrough on every RMSNorm / GroupRMSNorm / router gate / expert_bias / slope
  • Bundle size: 30 GB on-disk across 31 shards (24,576 routed-expert tensors at 2-bit + 211 affine 8-bit + 228 passthrough)
  • Runs on: M2 Pro 64 GB / M3 Max 64 GB+ / M4 Max 128 GB / M5 Max 128 GB

Architecture (bailing_hybrid)

This is a hybrid attention model — every 8th layer is full softmax MLA, the other 28 layers (of 32) are Lightning-Linear-Attention (Gated Linear Attention with per-head learned slopes). Plus a single Multi-Token Prediction head for speculative decoding.

Layer block Count Attention MLP
Layer 0 1 Linear (GLA) Dense MLP (intermediate=9216)
Layers 1–6, 8–14, 16–22, 24–30 27 Linear (GLA) MoE (256+1)
Layers 7, 15, 23, 31 4 MLA (full softmax) MoE (256+1)
MTP head (32) 1 MLA MoE (256+1)

MLA: DeepSeek-V2-flavored — q_lora_rank=1536, kv_lora_rank=512, qk_nope_head_dim=128, qk_rope_head_dim=64 (interleaved, theta=6M), v_head_dim=128. Materializes K and V (no absorb optimization).

Linear attention: Lightning-Attention-2 / Gated Linear Attention — fused query_key_value projection, per-head RMSNorm on Q/K, partial RoPE (50% of head_dim, adjacent-half rotation), per-head ALiBi-style slope scaled by layer_idx, sigmoid-gated output through g_proj + GroupRMSNorm (groups=4).

MoE routing: sigmoid scoring + grouped routing (8 groups × top-4 groups

  • top-8 within), e_score_correction_bias for noaux_tc — selection uses biased scores, weighting uses original sigmoid scores. Identical math to DeepSeek-V3 / Kimi-K2.6.

What's in the bundle

Module Source dtype Bundle dtype
Routed experts (256 × 3 mats × 32 MoE layers = 24,576 tensors) bf16 2-bit MXTQ + sidecar
Shared expert (per MoE layer) bf16 8-bit affine g=64
MLA attention (5 layers, q/q_a/q_b/kv_a/kv_b/dense) bf16 8-bit affine g=64
Linear attention (28 layers, query_key_value/dense/g_proj) bf16 8-bit affine g=64
Dense MLP (layer 0) bf16 8-bit affine g=64
MTP eh_proj bf16 8-bit affine g=64
embed_tokens / lm_head bf16 8-bit affine g=64
RMSNorm / GroupRMSNorm / router gate.weight bf16 fp16 passthrough
expert_bias (router score correction) fp32 fp32 passthrough

jangtq_runtime.safetensors sidecar (20.8 KB) for Swift runtimes — covers (in_features={1024,4096}, seed=42, bits=2) codebooks + sign-flip vectors.

Loading (Python)

pip install jang-tools mlx-lm
from jang_tools import load_jangtq
model, tokenizer = load_jangtq("JANGQ-AI/Ling-2.6-flash-JANGTQ")

The bundle ships configuration_bailing_moe_v2_5.py and modeling_bailing_moe_v2_5.py for HF compatibility, plus an mlx_lm model class (bailing_hybrid.py) that handles the hybrid attention dispatch, MLA, GLA recurrence, MoE routing, and MTP.

Reasoning + tools

The chat template defaults to detailed thinking off. To enable chain-of-thought:

messages = [
    {"role": "system", "content": "detailed thinking on"},
    {"role": "user",   "content": "..."},
]

The model emits <think>...</think> blocks before its answer when thinking is on; runtimes binding to the deepseek_r1 reasoning parser will extract them automatically. Tool calls follow the DeepSeek tool-call block format.

Credits

  • Quantization + MLX runtime: Jinho Jang (eric@jangq.ai)
  • Base model: inclusionAI — Ant Group's Bailing team
  • Architecture references: Lightning-Attention-2 (arXiv:2401.04658), DeepSeek-V2/V3 MLA, MTP from DeepSeek-V3
Downloads last month
648
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JANGQ-AI/Ling-2.6-flash-JANGTQ

Finetuned
(5)
this model

Paper for JANGQ-AI/Ling-2.6-flash-JANGTQ