--- title: TIDE-dllm emoji: 🌊 colorFrom: blue colorTo: indigo sdk: static pinned: true tags: - arxiv:2604.26951 ---
Cross-Architecture Distillation for Diffusion Large Language Models
🌊 The first cross-architecture distillation framework for diffusion LLMs — distilling 8B dense and 16B MoE teachers into a 0.6B student 🌊
--- This organization hosts the **distilled student checkpoints** and **pre-tokenized SFT datasets** released with TIDE. The framework consists of three modular components — **TIDAL** (dual-axis interpolation), **CompDemo** (complementary mask-split teacher inference), and **Reverse CALM** (cross-tokenizer chunk-level matching) — and is evaluated across two heterogeneous distillation pipelines. ## ✨ Highlights - **+1.53 average gain** over the non-distilled BD3LM baseline across 8 benchmarks (34.20 vs. 32.67). - **+16.48 on HumanEval** over the equivalent-size AR baseline (48.78 vs. 32.30) — distilled dLLMs especially excel at code generation. - **22× peak-memory reduction** vs. the 16B MoE LLaDA2 teacher (1.4 GB vs. 31.3 GB) and **5.2× faster inference** (6.25 s vs. 32.55 s for 256 tokens on H100). ## 🤖 Released models Six 0.6B distilled student checkpoints (3 per pipeline). Each is initialized from [`dllm-hub/Qwen3-0.6B-diffusion-bd3lm-v0.1`](https://huggingface.co/dllm-hub/Qwen3-0.6B-diffusion-bd3lm-v0.1) and distilled from a larger dLLM teacher. | Pipeline | Variant | Repo | |---|---|---| | A — Cross-Tokenizer (LLaDA2 teacher) | **TIDE-Cross** *(native, paper-best)* | [`distill-LLaDA2-TIDE_Cross`](https://huggingface.co/TIDE-dllm/distill-LLaDA2-TIDE_Cross) | | A — Cross-Tokenizer (LLaDA2 teacher) | TIDE-Shared variant | [`distill-LLaDA2-TIDE_Shared`](https://huggingface.co/TIDE-dllm/distill-LLaDA2-TIDE_Shared) | | A — Cross-Tokenizer (LLaDA2 teacher) | CALM baseline | [`distill-LLaDA2-CALM`](https://huggingface.co/TIDE-dllm/distill-LLaDA2-CALM) | | B — Shared-Tokenizer (WeDLM teacher) | **TIDE-Shared** *(native, paper-best)* | [`distill-WeDLM-TIDE_Shared`](https://huggingface.co/TIDE-dllm/distill-WeDLM-TIDE_Shared) | | B — Shared-Tokenizer (WeDLM teacher) | TIDE-Cross variant | [`distill-WeDLM-TIDE_Cross`](https://huggingface.co/TIDE-dllm/distill-WeDLM-TIDE_Cross) | | B — Shared-Tokenizer (WeDLM teacher) | KL baseline | [`distill-WeDLM-KL`](https://huggingface.co/TIDE-dllm/distill-WeDLM-KL) | ## 📚 Released datasets Pre-tokenized SFT mixtures (`tulu-3-sft-mixture` + `smoltalk` + `opc-sft-stage1` + `opc-sft-stage2`) prepared for each teacher, so distillation jobs never re-tokenize at startup. | Pipeline | Repo | |---|---| | A — for the LLaDA2 teacher | [`distill_llada2_sft`](https://huggingface.co/datasets/TIDE-dllm/distill_llada2_sft) | | B — for the WeDLM teacher | [`distill_wedlm_sft`](https://huggingface.co/datasets/TIDE-dllm/distill_wedlm_sft) | ## 🚀 Quick start ```python import torch from transformers import AutoModelForMaskedLM, AutoTokenizer repo = "TIDE-dllm/distill-LLaDA2-TIDE_Cross" # paper-best Pipeline-A checkpoint device = "cuda" if torch.cuda.is_available() else "cpu" model = AutoModelForMaskedLM.from_pretrained( repo, dtype=torch.bfloat16, trust_remote_code=True, ).to(device).eval() tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True) ``` The same `generate()` routine published with [`dllm-hub/Qwen3-0.6B-diffusion-bd3lm-v0.1`](https://huggingface.co/dllm-hub/Qwen3-0.6B-diffusion-bd3lm-v0.1) works on every TIDE checkpoint — just swap the model name. ## 📝 Citation ```bibtex @misc{zhang2026turningtidecrossarchitecturedistillation, title={Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models}, author={Gongbo Zhang and Wen Wang and Ye Tian and Li Yuan}, year={2026}, eprint={2604.26951}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2604.26951}, } ```