Upload README.md with huggingface_hub

93282f5 verified 2 days ago

3.68 kB

	---
	license: mit
	tags:
	- moe
	- mixture-of-experts
	- hybrid-attention
	- mla
	- lightning-attention
	- mxfp4
	- osaurus
	- mlx
	- bailing
	- ling
	- apple-silicon
	base_model: inclusionAI/Ling-2.6-flash
	pipeline_tag: text-generation
	library_name: mlx
	---

	<p align="center"><img src="osaurus-x-banner.png" width="100%"/></p>

	# Ling-2.6-flash-MXFP4

	~103B-A8B hybrid MoE — 63 GB on disk (down from the 200 GB bf16 source) —
	stock 4-bit affine quantization on inclusionAI's Bailing-V2.5 hybrid
	architecture. Loads via `mlx_lm.load()` with the `bailing_hybrid` model
	class — no TurboQuant runtime, no sidecar required.

	- Source: [inclusionAI/Ling-2.6-flash](https://huggingface.co/inclusionAI/Ling-2.6-flash)
	(Ant Group's Bailing-V2.5 hybrid: 32 layers MLA + Lightning-Linear-Attention,
	256 experts top-8, MTP head, 131K context)
	- Quantization: MXFP4 — every weight (routed experts, attention,
	shared experts, dense MLP, embed, lm_head) at **4-bit affine
	group_size=32**. Norms, router gates, expert biases, and slopes stay
	fp16/fp32 passthrough.
	- Bundle size: 63 GB on-disk across 51 shards
	- Runs on: M3 Max 96 GB+ / M4 Max 128 GB / M5 Max 128 GB / Mac Studio

	## Why two variants?

	\| \| JANGTQ2 \| MXFP4 \|
	\|---\|---\|---\|
	\| Routed experts \| 2-bit MXTQ codebook (Hadamard + Lloyd-Max) \| 4-bit affine \|
	\| Attention / shared / dense \| 8-bit affine \| 4-bit affine \|
	\| Bundle size \| 30 GB \| 63 GB \|
	\| Quality \| tighter (8-bit attention) \| uniform 4-bit \|
	\| Loader \| `jang_tools.load_jangtq` (TurboQuant kernel) \| stock `mlx_lm.load()` \|
	\| Sidecar \| required \| not needed \|
	\| Min RAM \| 64 GB \| 96 GB \|

	JANGTQ2 trades cheap-but-slow MXTQ codec on the routed experts for a tighter
	overall bit budget. MXFP4 is the simpler "just-works-with-stock-MLX" option
	for users who don't want the TurboQuant runtime in their stack.

	## Architecture (`bailing_hybrid`)

	Hybrid attention — every 8th layer is full softmax MLA, the other 28 of 32
	are Lightning-Linear-Attention. Plus a Multi-Token Prediction head.

	\| Layer block \| Count \| Attention \| MLP \|
	\|---\|---\|---\|---\|
	\| Layer 0 \| 1 \| Linear (GLA) \| Dense MLP (intermediate=9216) \|
	\| Layers 1–6, 8–14, 16–22, 24–30 \| 27 \| Linear (GLA) \| MoE (256+1) \|
	\| Layers 7, 15, 23, 31 \| 4 \| MLA (full softmax) \| MoE (256+1) \|
	\| MTP head (32) \| 1 \| MLA \| MoE (256+1) \|

	See the [JANGTQ variant card](https://huggingface.co/JANGQ-AI/Ling-2.6-flash-JANGTQ)
	for the deeper architecture writeup.

	## Loading (Python)

	```bash
	pip install mlx-lm jang-tools
	```

	```python
	from mlx_lm import load, generate
	model, tokenizer = load("OsaurusAI/Ling-2.6-flash-MXFP4")
	```

	Stock `mlx_lm.load()` works once `mlx_lm/models/bailing_hybrid.py` is
	present (shipped with `jang-tools >= TBD`). The bundle's
	`configuration_bailing_moe_v2_5.py` and `modeling_bailing_moe_v2_5.py`
	provide HF compatibility for tooling that goes through transformers.

	## Reasoning + tools

	Default is `detailed thinking off`. To enable:

	```python
	messages = [
	{"role": "system", "content": "detailed thinking on"},
	{"role": "user", "content": "..."},
	]
	```

	The model emits `<think>...</think>` reasoning blocks before answers when
	thinking is on. DeepSeek-style tool-call format.

	## Credits

	- Quantization + MLX runtime: Jinho Jang ([eric@osaurus.ai](mailto:eric@osaurus.ai))
	- Base model: [inclusionAI](https://huggingface.co/inclusionAI) — Ant
	Group's Bailing team
	- Architecture references: Lightning-Attention-2 (arXiv:2401.04658),
	DeepSeek-V2/V3 MLA, MTP from DeepSeek-V3
	- Osaurus: [osaurus.ai](https://osaurus.ai) — Apple-Silicon-first
	inference for open-weight LLMs.