Spaces:

TIDE-dllm
/

README

Running

App Files Files Community

README / README.md

N2048M

Add arxiv:2604.26951 tag to link the paper for HF indexing

d7d5fe2 verified 8 days ago

preview code

raw

history blame contribute delete

4.93 kB

	---
	title: TIDE-dllm
	emoji: 🌊
	colorFrom: blue
	colorTo: indigo
	sdk: static
	pinned: true
	tags:
	- arxiv:2604.26951
	---

	<div align="center">
	<img src="https://huggingface.co/spaces/TIDE-dllm/README/resolve/main/logo.gif" alt="TIDE logo" width="320" />
	</div>

	<h1 align="center">Turning the TIDE</h1>

	<p align="center"><em>Cross-Architecture Distillation for Diffusion Large Language Models</em></p>

	<p align="center">🌊 The first cross-architecture distillation framework for diffusion LLMs — distilling 8B dense and 16B MoE teachers into a 0.6B student 🌊</p>

	<p align="center">
	<a href="https://arxiv.org/abs/2604.26951"><img alt="arXiv" src="https://img.shields.io/badge/arXiv-2604.26951-b31b1b.svg?logo=arxiv" /></a>
	<a href="https://huggingface.co/papers/2604.26951"><img alt="HF Paper" src="https://img.shields.io/badge/%F0%9F%A4%97-Paper-blue" /></a>
	<a href="https://github.com/PKU-YuanGroup/TIDE"><img alt="Code" src="https://img.shields.io/badge/Code-PKU--YuanGroup%2FTIDE-181717.svg?logo=github" /></a>
	<a href="https://pku-yuangroup.github.io/TIDE-Page/"><img alt="Project Page" src="https://img.shields.io/badge/Project-Page-2ea44f" /></a>
	<a href="https://opensource.org/licenses/Apache-2.0"><img alt="License" src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" /></a>
	</p>

	---

	This organization hosts the distilled student checkpoints and pre-tokenized SFT datasets released with TIDE. The framework consists of three modular components — TIDAL (dual-axis interpolation), CompDemo (complementary mask-split teacher inference), and Reverse CALM (cross-tokenizer chunk-level matching) — and is evaluated across two heterogeneous distillation pipelines.

	## ✨ Highlights

	- +1.53 average gain over the non-distilled BD3LM baseline across 8 benchmarks (34.20 vs. 32.67).
	- +16.48 on HumanEval over the equivalent-size AR baseline (48.78 vs. 32.30) — distilled dLLMs especially excel at code generation.
	- 22× peak-memory reduction vs. the 16B MoE LLaDA2 teacher (1.4 GB vs. 31.3 GB) and 5.2× faster inference (6.25 s vs. 32.55 s for 256 tokens on H100).

	## 🤖 Released models

	Six 0.6B distilled student checkpoints (3 per pipeline). Each is initialized from [`dllm-hub/Qwen3-0.6B-diffusion-bd3lm-v0.1`](https://huggingface.co/dllm-hub/Qwen3-0.6B-diffusion-bd3lm-v0.1) and distilled from a larger dLLM teacher.

	\| Pipeline \| Variant \| Repo \|
	\|---\|---\|---\|
	\| A — Cross-Tokenizer (LLaDA2 teacher) \| TIDE-Cross (native, paper-best) \| [`distill-LLaDA2-TIDE_Cross`](https://huggingface.co/TIDE-dllm/distill-LLaDA2-TIDE_Cross) \|
	\| A — Cross-Tokenizer (LLaDA2 teacher) \| TIDE-Shared variant \| [`distill-LLaDA2-TIDE_Shared`](https://huggingface.co/TIDE-dllm/distill-LLaDA2-TIDE_Shared) \|
	\| A — Cross-Tokenizer (LLaDA2 teacher) \| CALM baseline \| [`distill-LLaDA2-CALM`](https://huggingface.co/TIDE-dllm/distill-LLaDA2-CALM) \|
	\| B — Shared-Tokenizer (WeDLM teacher) \| TIDE-Shared (native, paper-best) \| [`distill-WeDLM-TIDE_Shared`](https://huggingface.co/TIDE-dllm/distill-WeDLM-TIDE_Shared) \|
	\| B — Shared-Tokenizer (WeDLM teacher) \| TIDE-Cross variant \| [`distill-WeDLM-TIDE_Cross`](https://huggingface.co/TIDE-dllm/distill-WeDLM-TIDE_Cross) \|
	\| B — Shared-Tokenizer (WeDLM teacher) \| KL baseline \| [`distill-WeDLM-KL`](https://huggingface.co/TIDE-dllm/distill-WeDLM-KL) \|

	## 📚 Released datasets

	Pre-tokenized SFT mixtures (`tulu-3-sft-mixture` + `smoltalk` + `opc-sft-stage1` + `opc-sft-stage2`) prepared for each teacher, so distillation jobs never re-tokenize at startup.

	\| Pipeline \| Repo \|
	\|---\|---\|
	\| A — for the LLaDA2 teacher \| [`distill_llada2_sft`](https://huggingface.co/datasets/TIDE-dllm/distill_llada2_sft) \|
	\| B — for the WeDLM teacher \| [`distill_wedlm_sft`](https://huggingface.co/datasets/TIDE-dllm/distill_wedlm_sft) \|

	## 🚀 Quick start

	```python
	import torch
	from transformers import AutoModelForMaskedLM, AutoTokenizer

	repo = "TIDE-dllm/distill-LLaDA2-TIDE_Cross" # paper-best Pipeline-A checkpoint
	device = "cuda" if torch.cuda.is_available() else "cpu"

	model = AutoModelForMaskedLM.from_pretrained(
	repo, dtype=torch.bfloat16, trust_remote_code=True,
	).to(device).eval()
	tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
	```

	The same `generate()` routine published with [`dllm-hub/Qwen3-0.6B-diffusion-bd3lm-v0.1`](https://huggingface.co/dllm-hub/Qwen3-0.6B-diffusion-bd3lm-v0.1) works on every TIDE checkpoint — just swap the model name.

	## 📝 Citation

	```bibtex
	@misc{zhang2026turningtidecrossarchitecturedistillation,
	title={Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models},
	author={Gongbo Zhang and Wen Wang and Ye Tian and Li Yuan},
	year={2026},
	eprint={2604.26951},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2604.26951},
	}
	```