Osaurus-AI's picture
Upload README.md with huggingface_hub
93282f5 verified
metadata
license: mit
tags:
  - moe
  - mixture-of-experts
  - hybrid-attention
  - mla
  - lightning-attention
  - mxfp4
  - osaurus
  - mlx
  - bailing
  - ling
  - apple-silicon
base_model: inclusionAI/Ling-2.6-flash
pipeline_tag: text-generation
library_name: mlx

Ling-2.6-flash-MXFP4

~103B-A8B hybrid MoE β€” 63 GB on disk (down from the 200 GB bf16 source) β€” stock 4-bit affine quantization on inclusionAI's Bailing-V2.5 hybrid architecture. Loads via mlx_lm.load() with the bailing_hybrid model class β€” no TurboQuant runtime, no sidecar required.

  • Source: inclusionAI/Ling-2.6-flash (Ant Group's Bailing-V2.5 hybrid: 32 layers MLA + Lightning-Linear-Attention, 256 experts top-8, MTP head, 131K context)
  • Quantization: MXFP4 β€” every weight (routed experts, attention, shared experts, dense MLP, embed, lm_head) at 4-bit affine group_size=32. Norms, router gates, expert biases, and slopes stay fp16/fp32 passthrough.
  • Bundle size: 63 GB on-disk across 51 shards
  • Runs on: M3 Max 96 GB+ / M4 Max 128 GB / M5 Max 128 GB / Mac Studio

Why two variants?

JANGTQ2 MXFP4
Routed experts 2-bit MXTQ codebook (Hadamard + Lloyd-Max) 4-bit affine
Attention / shared / dense 8-bit affine 4-bit affine
Bundle size 30 GB 63 GB
Quality tighter (8-bit attention) uniform 4-bit
Loader jang_tools.load_jangtq (TurboQuant kernel) stock mlx_lm.load()
Sidecar required not needed
Min RAM 64 GB 96 GB

JANGTQ2 trades cheap-but-slow MXTQ codec on the routed experts for a tighter overall bit budget. MXFP4 is the simpler "just-works-with-stock-MLX" option for users who don't want the TurboQuant runtime in their stack.

Architecture (bailing_hybrid)

Hybrid attention β€” every 8th layer is full softmax MLA, the other 28 of 32 are Lightning-Linear-Attention. Plus a Multi-Token Prediction head.

Layer block Count Attention MLP
Layer 0 1 Linear (GLA) Dense MLP (intermediate=9216)
Layers 1–6, 8–14, 16–22, 24–30 27 Linear (GLA) MoE (256+1)
Layers 7, 15, 23, 31 4 MLA (full softmax) MoE (256+1)
MTP head (32) 1 MLA MoE (256+1)

See the JANGTQ variant card for the deeper architecture writeup.

Loading (Python)

pip install mlx-lm jang-tools
from mlx_lm import load, generate
model, tokenizer = load("OsaurusAI/Ling-2.6-flash-MXFP4")

Stock mlx_lm.load() works once mlx_lm/models/bailing_hybrid.py is present (shipped with jang-tools >= TBD). The bundle's configuration_bailing_moe_v2_5.py and modeling_bailing_moe_v2_5.py provide HF compatibility for tooling that goes through transformers.

Reasoning + tools

Default is detailed thinking off. To enable:

messages = [
    {"role": "system", "content": "detailed thinking on"},
    {"role": "user",   "content": "..."},
]

The model emits <think>...</think> reasoning blocks before answers when thinking is on. DeepSeek-style tool-call format.

Credits

  • Quantization + MLX runtime: Jinho Jang (eric@osaurus.ai)
  • Base model: inclusionAI β€” Ant Group's Bailing team
  • Architecture references: Lightning-Attention-2 (arXiv:2401.04658), DeepSeek-V2/V3 MLA, MTP from DeepSeek-V3
  • Osaurus: osaurus.ai β€” Apple-Silicon-first inference for open-weight LLMs.