Spaces:
Runtime error
Runtime error
Upload README.md
Browse files
README.md
CHANGED
|
@@ -1,15 +1,69 @@
|
|
| 1 |
---
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
emoji: π
|
| 5 |
colorFrom: purple
|
| 6 |
-
colorTo:
|
| 7 |
-
|
| 8 |
-
|
| 9 |
app_file: app.py
|
| 10 |
-
pinned:
|
| 11 |
-
tags:
|
| 12 |
-
- ml-intern
|
| 13 |
---
|
| 14 |
|
| 15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: "DFlash-MLX-Universal: Interactive Demo"
|
| 3 |
+
emoji: π
|
|
|
|
| 4 |
colorFrom: purple
|
| 5 |
+
colorTo: blue
|
| 6 |
+
sdk: gradio
|
| 7 |
+
sdk_version: "5.0.0"
|
| 8 |
app_file: app.py
|
| 9 |
+
pinned: true
|
|
|
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
+
# π DFlash-MLX-Universal Demo
|
| 13 |
+
|
| 14 |
+
**Block Diffusion Speculative Decoding for Apple Silicon (MLX)**
|
| 15 |
+
|
| 16 |
+
This interactive demo showcases [DFlash](https://arxiv.org/abs/2602.06036) β a block diffusion model that accelerates LLM inference by **6Γ** on Apple Silicon with **lossless output**.
|
| 17 |
+
|
| 18 |
+
## What is DFlash?
|
| 19 |
+
|
| 20 |
+
- **Traditional speculative decoding**: Drafts 1 token at a time β 2-3Γ speedup
|
| 21 |
+
- **DFlash**: Drafts 16 tokens in parallel via diffusion β **6Γ speedup**
|
| 22 |
+
- **Key innovation**: Draft model conditions on target model's hidden states (KV injection)
|
| 23 |
+
- **Result**: Output identical to greedy autoregressive generation
|
| 24 |
+
|
| 25 |
+
## Demo Tabs
|
| 26 |
+
|
| 27 |
+
| Tab | What it does |
|
| 28 |
+
|-----|-------------|
|
| 29 |
+
| π **Quick Start** | Select a model, enter a prompt, generate code & see simulated results |
|
| 30 |
+
| π οΈ **Convert Drafter** | Get the `uv` command to convert official drafters to MLX format |
|
| 31 |
+
| π **Training** | Code template to train custom drafters for unsupported models |
|
| 32 |
+
| π₯οΈ **Server** | Commands to start an OpenAI-compatible local server |
|
| 33 |
+
| π **Benchmarks** | Performance table: 6Γ speedup across 6 models |
|
| 34 |
+
| π **Architecture** | Deep dive into how block diffusion + KV injection works |
|
| 35 |
+
| π¦ **Installation** | `uv` and `pip` setup instructions |
|
| 36 |
+
|
| 37 |
+
## Supported Models
|
| 38 |
+
|
| 39 |
+
- **Qwen3** (4B, 8B)
|
| 40 |
+
- **Qwen3.5** (4B, 9B, 27B)
|
| 41 |
+
- **Qwen3.6** (27B, 35B-A3B)
|
| 42 |
+
- **LLaMA-3.1** (8B)
|
| 43 |
+
- **Gemma-4** (31B)
|
| 44 |
+
|
| 45 |
+
## Quick Start (on your Mac)
|
| 46 |
+
|
| 47 |
+
```bash
|
| 48 |
+
# 1. Install uv
|
| 49 |
+
brew install uv
|
| 50 |
+
|
| 51 |
+
# 2. Clone and setup
|
| 52 |
+
git clone https://huggingface.co/tritesh/dflash-mlx-universal.git
|
| 53 |
+
cd dflash-mlx-universal
|
| 54 |
+
./setup_uv.sh
|
| 55 |
+
|
| 56 |
+
# 3. Convert a drafter
|
| 57 |
+
uv run python -m dflash_mlx.convert \
|
| 58 |
+
--model z-lab/Qwen3-4B-DFlash-b16 \
|
| 59 |
+
--output ./Qwen3-4B-DFlash-mlx
|
| 60 |
+
|
| 61 |
+
# 4. Generate
|
| 62 |
+
uv run python examples/qwen3_4b_demo.py
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
## Links
|
| 66 |
+
|
| 67 |
+
- **Paper**: [arXiv:2602.06036](https://arxiv.org/abs/2602.06036)
|
| 68 |
+
- **Repository**: [tritesh/dflash-mlx-universal](https://huggingface.co/tritesh/dflash-mlx-universal)
|
| 69 |
+
- **Package**: `dflash-mlx-universal` (PyPI compatible)
|