Upload README.md with huggingface_hub

c1eb035 verified 1 day ago

4.9 kB

	---
	license: mit
	tags:
	- moe
	- mixture-of-experts
	- hybrid-attention
	- mla
	- lightning-attention
	- jangtq
	- osaurus
	- mlx
	- bailing
	- ling
	- apple-silicon
	base_model: inclusionAI/Ling-2.6-flash
	pipeline_tag: text-generation
	library_name: mlx
	---

	<p align="center"><img src="osaurus-x-banner.png" width="100%"/></p>

	# Ling-2.6-flash-JANGTQ

	~103B-A8B hybrid MoE — 30 GB on disk (down from the 200 GB bf16 source) —
	mixed-precision JANGTQ2 quantization on inclusionAI's Bailing-V2.5 hybrid
	architecture. Runs on a 64 GB+ Apple Silicon Mac.

	- Source: [inclusionAI/Ling-2.6-flash](https://huggingface.co/inclusionAI/Ling-2.6-flash)
	(Ant Group's Bailing-V2.5 hybrid: 32 layers MLA + Lightning-Linear-Attention,
	256 experts top-8, MTP head, 131K context)
	- Quantization: JANGTQ2 — 2-bit MXTQ codebook (Hadamard-rotated,
	Lloyd-Max optimized) on routed-expert weights + 8-bit affine on every
	attention path (MLA `q_a`/`q_b`/`kv_a`/`kv_b`/`dense`, Linear-Attn
	`query_key_value`/`g_proj`/`dense`) + 8-bit affine on shared experts /
	layer-0 dense MLP / MTP `eh_proj` / embed / lm_head + fp16 passthrough on
	every RMSNorm / GroupRMSNorm / router gate / `expert_bias` / `slope`
	- Bundle size: 30 GB on-disk across 31 shards (24,576 routed-expert
	tensors at 2-bit + 211 affine 8-bit + 228 passthrough)
	- Runs on: M2 Pro 64 GB / M3 Max 64 GB+ / M4 Max 128 GB / M5 Max 128 GB

	## Architecture (`bailing_hybrid`)

	This is a hybrid attention model — every 8th layer is full softmax MLA,
	the other 28 layers (of 32) are Lightning-Linear-Attention (Gated Linear
	Attention with per-head learned slopes). Plus a single Multi-Token
	Prediction head for speculative decoding.

	\| Layer block \| Count \| Attention \| MLP \|
	\|---\|---\|---\|---\|
	\| Layer 0 \| 1 \| Linear (GLA) \| Dense MLP (intermediate=9216) \|
	\| Layers 1–6, 8–14, 16–22, 24–30 \| 27 \| Linear (GLA) \| MoE (256+1) \|
	\| Layers 7, 15, 23, 31 \| 4 \| MLA (full softmax) \| MoE (256+1) \|
	\| MTP head (32) \| 1 \| MLA \| MoE (256+1) \|

	MLA: DeepSeek-V2-flavored — `q_lora_rank=1536`, `kv_lora_rank=512`,
	`qk_nope_head_dim=128`, `qk_rope_head_dim=64` (interleaved, theta=6M),
	`v_head_dim=128`. Materializes K and V (no absorb optimization).

	Linear attention: Lightning-Attention-2 / Gated Linear Attention — fused
	`query_key_value` projection, per-head RMSNorm on Q/K, partial RoPE (50% of
	head_dim, adjacent-half rotation), per-head ALiBi-style slope scaled by
	layer_idx, sigmoid-gated output through `g_proj` + `GroupRMSNorm` (groups=4).

	MoE routing: sigmoid scoring + grouped routing (8 groups × top-4 groups
	+ top-8 within), `e_score_correction_bias` for noaux_tc — selection uses
	biased scores, weighting uses original sigmoid scores. Identical math to
	DeepSeek-V3 / Kimi-K2.6.

	## What's in the bundle

	\| Module \| Source dtype \| Bundle dtype \|
	\|---\|---\|---\|
	\| Routed experts (256 × 3 mats × 32 MoE layers = 24,576 tensors) \| bf16 \| 2-bit MXTQ + sidecar \|
	\| Shared expert (per MoE layer) \| bf16 \| 8-bit affine g=64 \|
	\| MLA attention (5 layers, q/q_a/q_b/kv_a/kv_b/dense) \| bf16 \| 8-bit affine g=64 \|
	\| Linear attention (28 layers, query_key_value/dense/g_proj) \| bf16 \| 8-bit affine g=64 \|
	\| Dense MLP (layer 0) \| bf16 \| 8-bit affine g=64 \|
	\| MTP `eh_proj` \| bf16 \| 8-bit affine g=64 \|
	\| `embed_tokens` / `lm_head` \| bf16 \| 8-bit affine g=64 \|
	\| RMSNorm / GroupRMSNorm / router `gate.weight` \| bf16 \| fp16 passthrough \|
	\| `expert_bias` (router score correction) \| fp32 \| fp32 passthrough \|

	`jangtq_runtime.safetensors` sidecar (20.8 KB) for Swift runtimes — covers
	`(in_features={1024,4096}, seed=42, bits=2)` codebooks + sign-flip vectors.

	## Loading (Python)

	```bash
	pip install jang-tools mlx-lm
	```

	```python
	from jang_tools import load_jangtq
	model, tokenizer = load_jangtq("OsaurusAI/Ling-2.6-flash-JANGTQ")
	```

	The bundle ships `configuration_bailing_moe_v2_5.py` and
	`modeling_bailing_moe_v2_5.py` for HF compatibility, plus an `mlx_lm`
	model class (`bailing_hybrid.py`) that handles the hybrid attention
	dispatch, MLA, GLA recurrence, MoE routing, and MTP.

	## Reasoning + tools

	The chat template defaults to `detailed thinking off`. To enable
	chain-of-thought:

	```python
	messages = [
	{"role": "system", "content": "detailed thinking on"},
	{"role": "user", "content": "..."},
	]
	```

	The model emits `<think>...</think>` blocks before its answer when thinking
	is on; runtimes binding to the `deepseek_r1` reasoning parser will extract
	them automatically. Tool calls follow the DeepSeek tool-call block format.

	## Credits

	- Quantization + MLX runtime: Jinho Jang ([eric@osaurus.ai](mailto:eric@osaurus.ai))
	- Base model: [inclusionAI](https://huggingface.co/inclusionAI) — Ant
	Group's Bailing team
	- Architecture references: Lightning-Attention-2 (arXiv:2401.04658),
	DeepSeek-V2/V3 MLA, MTP from DeepSeek-V3