File size: 2,214 Bytes
edca3c2
5c66bee
0b3ae42
edca3c2
0b3ae42
 
5c66bee
edca3c2
0b3ae42
5c66bee
 
edca3c2
 
0b3ae42
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
---
title: 'DFlash-MLX-Universal: Interactive Demo'
emoji: πŸš€
colorFrom: purple
colorTo: blue
sdk: gradio
sdk_version: 5.0.0
app_file: app.py
pinned: true
tags:
- ml-intern
---

# πŸš€ DFlash-MLX-Universal Demo

**Block Diffusion Speculative Decoding for Apple Silicon (MLX)**

This interactive demo showcases [DFlash](https://arxiv.org/abs/2602.06036) β€” a block diffusion model that accelerates LLM inference by **6Γ—** on Apple Silicon with **lossless output**.

## What is DFlash?

- **Traditional speculative decoding**: Drafts 1 token at a time β†’ 2-3Γ— speedup
- **DFlash**: Drafts 16 tokens in parallel via diffusion β†’ **6Γ— speedup**
- **Key innovation**: Draft model conditions on target model's hidden states (KV injection)
- **Result**: Output identical to greedy autoregressive generation

## Demo Tabs

| Tab | What it does |
|-----|-------------|
| πŸƒ **Quick Start** | Select a model, enter a prompt, generate code & see simulated results |
| πŸ› οΈ **Convert Drafter** | Get the `uv` command to convert official drafters to MLX format |
| πŸŽ“ **Training** | Code template to train custom drafters for unsupported models |
| πŸ–₯️ **Server** | Commands to start an OpenAI-compatible local server |
| πŸ“Š **Benchmarks** | Performance table: 6Γ— speedup across 6 models |
| πŸ“– **Architecture** | Deep dive into how block diffusion + KV injection works |
| πŸ“¦ **Installation** | `uv` and `pip` setup instructions |

## Supported Models

- **Qwen3** (4B, 8B)
- **Qwen3.5** (4B, 9B, 27B)
- **Qwen3.6** (27B, 35B-A3B)
- **LLaMA-3.1** (8B)
- **Gemma-4** (31B)

## Quick Start (on your Mac)

```bash
# 1. Install uv
brew install uv

# 2. Clone and setup
git clone https://huggingface.co/tritesh/dflash-mlx-universal.git
cd dflash-mlx-universal
./setup_uv.sh

# 3. Convert a drafter
uv run python -m dflash_mlx.convert \
    --model z-lab/Qwen3-4B-DFlash-b16 \
    --output ./Qwen3-4B-DFlash-mlx

# 4. Generate
uv run python examples/qwen3_4b_demo.py
```

## Links

- **Paper**: [arXiv:2602.06036](https://arxiv.org/abs/2602.06036)
- **Repository**: [tritesh/dflash-mlx-universal](https://huggingface.co/tritesh/dflash-mlx-universal)
- **Package**: `dflash-mlx-universal` (PyPI compatible)