tritesh commited on
Commit
0b3ae42
Β·
verified Β·
1 Parent(s): d728bf2

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +64 -10
README.md CHANGED
@@ -1,15 +1,69 @@
1
  ---
2
- sdk: gradio
3
- title: Dflash Mlx Universal Demo
4
- emoji: πŸ“Š
5
  colorFrom: purple
6
- colorTo: purple
7
- sdk_version: 6.14.0
8
- python_version: '3.13'
9
  app_file: app.py
10
- pinned: false
11
- tags:
12
- - ml-intern
13
  ---
14
 
15
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: "DFlash-MLX-Universal: Interactive Demo"
3
+ emoji: πŸš€
 
4
  colorFrom: purple
5
+ colorTo: blue
6
+ sdk: gradio
7
+ sdk_version: "5.0.0"
8
  app_file: app.py
9
+ pinned: true
 
 
10
  ---
11
 
12
+ # πŸš€ DFlash-MLX-Universal Demo
13
+
14
+ **Block Diffusion Speculative Decoding for Apple Silicon (MLX)**
15
+
16
+ This interactive demo showcases [DFlash](https://arxiv.org/abs/2602.06036) β€” a block diffusion model that accelerates LLM inference by **6Γ—** on Apple Silicon with **lossless output**.
17
+
18
+ ## What is DFlash?
19
+
20
+ - **Traditional speculative decoding**: Drafts 1 token at a time β†’ 2-3Γ— speedup
21
+ - **DFlash**: Drafts 16 tokens in parallel via diffusion β†’ **6Γ— speedup**
22
+ - **Key innovation**: Draft model conditions on target model's hidden states (KV injection)
23
+ - **Result**: Output identical to greedy autoregressive generation
24
+
25
+ ## Demo Tabs
26
+
27
+ | Tab | What it does |
28
+ |-----|-------------|
29
+ | πŸƒ **Quick Start** | Select a model, enter a prompt, generate code & see simulated results |
30
+ | πŸ› οΈ **Convert Drafter** | Get the `uv` command to convert official drafters to MLX format |
31
+ | πŸŽ“ **Training** | Code template to train custom drafters for unsupported models |
32
+ | πŸ–₯️ **Server** | Commands to start an OpenAI-compatible local server |
33
+ | πŸ“Š **Benchmarks** | Performance table: 6Γ— speedup across 6 models |
34
+ | πŸ“– **Architecture** | Deep dive into how block diffusion + KV injection works |
35
+ | πŸ“¦ **Installation** | `uv` and `pip` setup instructions |
36
+
37
+ ## Supported Models
38
+
39
+ - **Qwen3** (4B, 8B)
40
+ - **Qwen3.5** (4B, 9B, 27B)
41
+ - **Qwen3.6** (27B, 35B-A3B)
42
+ - **LLaMA-3.1** (8B)
43
+ - **Gemma-4** (31B)
44
+
45
+ ## Quick Start (on your Mac)
46
+
47
+ ```bash
48
+ # 1. Install uv
49
+ brew install uv
50
+
51
+ # 2. Clone and setup
52
+ git clone https://huggingface.co/tritesh/dflash-mlx-universal.git
53
+ cd dflash-mlx-universal
54
+ ./setup_uv.sh
55
+
56
+ # 3. Convert a drafter
57
+ uv run python -m dflash_mlx.convert \
58
+ --model z-lab/Qwen3-4B-DFlash-b16 \
59
+ --output ./Qwen3-4B-DFlash-mlx
60
+
61
+ # 4. Generate
62
+ uv run python examples/qwen3_4b_demo.py
63
+ ```
64
+
65
+ ## Links
66
+
67
+ - **Paper**: [arXiv:2602.06036](https://arxiv.org/abs/2602.06036)
68
+ - **Repository**: [tritesh/dflash-mlx-universal](https://huggingface.co/tritesh/dflash-mlx-universal)
69
+ - **Package**: `dflash-mlx-universal` (PyPI compatible)