Upload README.md with huggingface_hub

93282f5 verified 2 days ago

3.68 kB

license: mit
tags:
  - moe
  - mixture-of-experts
  - hybrid-attention
  - mla
  - lightning-attention
  - mxfp4
  - osaurus
  - mlx
  - bailing
  - ling
  - apple-silicon
base_model: inclusionAI/Ling-2.6-flash
pipeline_tag: text-generation
library_name: mlx

Ling-2.6-flash-MXFP4

~103B-A8B hybrid MoE — 63 GB on disk (down from the 200 GB bf16 source) — stock 4-bit affine quantization on inclusionAI's Bailing-V2.5 hybrid architecture. Loads via mlx_lm.load() with the bailing_hybrid model class — no TurboQuant runtime, no sidecar required.

Source: inclusionAI/Ling-2.6-flash (Ant Group's Bailing-V2.5 hybrid: 32 layers MLA + Lightning-Linear-Attention, 256 experts top-8, MTP head, 131K context)
Quantization: MXFP4 — every weight (routed experts, attention, shared experts, dense MLP, embed, lm_head) at 4-bit affine group_size=32. Norms, router gates, expert biases, and slopes stay fp16/fp32 passthrough.
Bundle size: 63 GB on-disk across 51 shards
Runs on: M3 Max 96 GB+ / M4 Max 128 GB / M5 Max 128 GB / Mac Studio

Why two variants?

	JANGTQ2	MXFP4
Routed experts	2-bit MXTQ codebook (Hadamard + Lloyd-Max)	4-bit affine
Attention / shared / dense	8-bit affine	4-bit affine
Bundle size	30 GB	63 GB
Quality	tighter (8-bit attention)	uniform 4-bit
Loader	`jang_tools.load_jangtq` (TurboQuant kernel)	stock `mlx_lm.load()`
Sidecar	required	not needed
Min RAM	64 GB	96 GB

JANGTQ2 trades cheap-but-slow MXTQ codec on the routed experts for a tighter overall bit budget. MXFP4 is the simpler "just-works-with-stock-MLX" option for users who don't want the TurboQuant runtime in their stack.

Architecture (`bailing_hybrid`)

Hybrid attention — every 8th layer is full softmax MLA, the other 28 of 32 are Lightning-Linear-Attention. Plus a Multi-Token Prediction head.

Layer block	Count	Attention	MLP
Layer 0	1	Linear (GLA)	Dense MLP (intermediate=9216)
Layers 1–6, 8–14, 16–22, 24–30	27	Linear (GLA)	MoE (256+1)
Layers 7, 15, 23, 31	4	MLA (full softmax)	MoE (256+1)
MTP head (32)	1	MLA	MoE (256+1)

See the JANGTQ variant card for the deeper architecture writeup.

Loading (Python)

pip install mlx-lm jang-tools

from mlx_lm import load, generate
model, tokenizer = load("OsaurusAI/Ling-2.6-flash-MXFP4")

Stock mlx_lm.load() works once mlx_lm/models/bailing_hybrid.py is present (shipped with jang-tools >= TBD). The bundle's configuration_bailing_moe_v2_5.py and modeling_bailing_moe_v2_5.py provide HF compatibility for tooling that goes through transformers.

Reasoning + tools

Default is detailed thinking off. To enable:

messages = [
    {"role": "system", "content": "detailed thinking on"},
    {"role": "user",   "content": "..."},
]

The model emits <think>...</think> reasoning blocks before answers when thinking is on. DeepSeek-style tool-call format.

Credits

Quantization + MLX runtime: Jinho Jang (eric@osaurus.ai)
Base model: inclusionAI — Ant Group's Bailing team
Architecture references: Lightning-Attention-2 (arXiv:2401.04658), DeepSeek-V2/V3 MLA, MTP from DeepSeek-V3
Osaurus: osaurus.ai — Apple-Silicon-first inference for open-weight LLMs.