N2048M commited on
Commit
a3c5430
Β·
verified Β·
1 Parent(s): dc6a38a

Remove index.html so org card renders README.md (matches Qwen/meta-llama/google convention)

Browse files
Files changed (1) hide show
  1. index.html +0 -86
index.html DELETED
@@ -1,86 +0,0 @@
1
- <div align="center">
2
-
3
- <img src="https://huggingface.co/spaces/TIDE-dllm/README/resolve/main/logo.gif" alt="TIDE logo" width="320" />
4
-
5
- <h1>Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models</h1>
6
-
7
- <p>🌊 The first cross-architecture distillation framework for diffusion LLMs β€” distilling 8B dense and 16B MoE teachers into a 0.6B student 🌊</p>
8
-
9
- <p>
10
- <a href="https://arxiv.org/abs/2604.26951"><img alt="arXiv" src="https://img.shields.io/badge/arXiv-2604.26951-b31b1b.svg?logo=arxiv" /></a>
11
- <a href="https://github.com/PKU-YuanGroup/TIDE"><img alt="Code" src="https://img.shields.io/badge/Code-PKU--YuanGroup%2FTIDE-181717.svg?logo=github" /></a>
12
- <a href="https://pku-yuangroup.github.io/TIDE-Page/"><img alt="Project Page" src="https://img.shields.io/badge/Project-Page-2ea44f" /></a>
13
- <a href="https://huggingface.co/papers/2604.26951"><img alt="HF Paper" src="https://img.shields.io/badge/%F0%9F%A4%97-Paper-blue" /></a>
14
- </p>
15
-
16
- </div>
17
-
18
- <p>This organization hosts the <strong>distilled student checkpoints</strong> and <strong>pre-tokenized SFT datasets</strong> released with TIDE. The framework consists of three modular components β€” <strong>TIDAL</strong> (dual-axis interpolation), <strong>CompDemo</strong> (complementary mask-split teacher inference), and <strong>Reverse CALM</strong> (cross-tokenizer chunk-level matching) β€” and is evaluated across two heterogeneous distillation pipelines.</p>
19
-
20
- <h2>✨ Highlights</h2>
21
- <ul>
22
- <li><strong>+1.53 average gain</strong> over the non-distilled BD3LM baseline across 8 benchmarks (34.20 vs. 32.67).</li>
23
- <li><strong>+16.48 on HumanEval</strong> over the equivalent-size AR baseline (48.78 vs. 32.30) β€” distilled dLLMs especially excel at code generation.</li>
24
- <li><strong>22Γ— peak-memory reduction</strong> vs. the 16B MoE LLaDA2 teacher (1.4 GB vs. 31.3 GB) and <strong>5.2Γ— faster inference</strong> (6.25 s vs. 32.55 s for 256 tokens on H100).</li>
25
- </ul>
26
-
27
- <h2>πŸ€– Released models</h2>
28
-
29
- <p>Six 0.6B distilled student checkpoints (3 per pipeline). Each is initialized from <a href="https://huggingface.co/dllm-hub/Qwen3-0.6B-diffusion-bd3lm-v0.1"><code>dllm-hub/Qwen3-0.6B-diffusion-bd3lm-v0.1</code></a> and distilled from a larger dLLM teacher.</p>
30
-
31
- <table>
32
- <thead>
33
- <tr><th>Pipeline</th><th>Variant</th><th>Repo</th></tr>
34
- </thead>
35
- <tbody>
36
- <tr><td>A β€” Cross-Tokenizer (LLaDA2 teacher)</td><td><strong>TIDE-Cross</strong> (native, paper-best)</td><td><a href="https://huggingface.co/TIDE-dllm/distill-LLaDA2-TIDE_Cross">distill-LLaDA2-TIDE_Cross</a></td></tr>
37
- <tr><td>A β€” Cross-Tokenizer (LLaDA2 teacher)</td><td>TIDE-Shared variant</td><td><a href="https://huggingface.co/TIDE-dllm/distill-LLaDA2-TIDE_Shared">distill-LLaDA2-TIDE_Shared</a></td></tr>
38
- <tr><td>A β€” Cross-Tokenizer (LLaDA2 teacher)</td><td>CALM baseline</td><td><a href="https://huggingface.co/TIDE-dllm/distill-LLaDA2-CALM">distill-LLaDA2-CALM</a></td></tr>
39
- <tr><td>B β€” Shared-Tokenizer (WeDLM teacher)</td><td><strong>TIDE-Shared</strong> (native, paper-best)</td><td><a href="https://huggingface.co/TIDE-dllm/distill-WeDLM-TIDE_Shared">distill-WeDLM-TIDE_Shared</a></td></tr>
40
- <tr><td>B β€” Shared-Tokenizer (WeDLM teacher)</td><td>TIDE-Cross variant</td><td><a href="https://huggingface.co/TIDE-dllm/distill-WeDLM-TIDE_Cross">distill-WeDLM-TIDE_Cross</a></td></tr>
41
- <tr><td>B β€” Shared-Tokenizer (WeDLM teacher)</td><td>KL baseline</td><td><a href="https://huggingface.co/TIDE-dllm/distill-WeDLM-KL">distill-WeDLM-KL</a></td></tr>
42
- </tbody>
43
- </table>
44
-
45
- <h2>πŸ“š Released datasets</h2>
46
-
47
- <p>Pre-tokenized SFT mixtures (<code>tulu-3-sft-mixture</code> + <code>smoltalk</code> + <code>opc-sft-stage1</code> + <code>opc-sft-stage2</code>) prepared for each teacher, so distillation jobs never re-tokenize at startup.</p>
48
-
49
- <table>
50
- <thead>
51
- <tr><th>Pipeline</th><th>Repo</th></tr>
52
- </thead>
53
- <tbody>
54
- <tr><td>A β€” for the LLaDA2 teacher</td><td><a href="https://huggingface.co/datasets/TIDE-dllm/distill_llada2_sft">distill_llada2_sft</a></td></tr>
55
- <tr><td>B β€” for the WeDLM teacher</td><td><a href="https://huggingface.co/datasets/TIDE-dllm/distill_wedlm_sft">distill_wedlm_sft</a></td></tr>
56
- </tbody>
57
- </table>
58
-
59
- <h2>πŸš€ Quick start</h2>
60
-
61
- <pre><code>import torch
62
- from transformers import AutoModelForMaskedLM, AutoTokenizer
63
-
64
- repo = "TIDE-dllm/distill-LLaDA2-TIDE_Cross" # paper-best Pipeline-A checkpoint
65
- device = "cuda" if torch.cuda.is_available() else "cpu"
66
-
67
- model = AutoModelForMaskedLM.from_pretrained(
68
- repo, dtype=torch.bfloat16, trust_remote_code=True,
69
- ).to(device).eval()
70
- tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
71
- </code></pre>
72
-
73
- <p>The same <code>generate()</code> routine published with <a href="https://huggingface.co/dllm-hub/Qwen3-0.6B-diffusion-bd3lm-v0.1"><code>dllm-hub/Qwen3-0.6B-diffusion-bd3lm-v0.1</code></a> works on every TIDE checkpoint β€” just swap the model name.</p>
74
-
75
- <h2>πŸ“ Citation</h2>
76
-
77
- <pre><code>@misc{zhang2026turningtidecrossarchitecturedistillation,
78
- title={Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models},
79
- author={Gongbo Zhang and Wen Wang and Ye Tian and Li Yuan},
80
- year={2026},
81
- eprint={2604.26951},
82
- archivePrefix={arXiv},
83
- primaryClass={cs.CL},
84
- url={https://arxiv.org/abs/2604.26951},
85
- }
86
- </code></pre>