Spaces:
Runtime error
Runtime error
A newer version of the Gradio SDK is available: 6.14.0
metadata
title: 'DFlash-MLX-Universal: Interactive Demo'
emoji: π
colorFrom: purple
colorTo: blue
sdk: gradio
sdk_version: 5.0.0
app_file: app.py
pinned: true
tags:
- ml-intern
π DFlash-MLX-Universal Demo
Block Diffusion Speculative Decoding for Apple Silicon (MLX)
This interactive demo showcases DFlash β a block diffusion model that accelerates LLM inference by 6Γ on Apple Silicon with lossless output.
What is DFlash?
- Traditional speculative decoding: Drafts 1 token at a time β 2-3Γ speedup
- DFlash: Drafts 16 tokens in parallel via diffusion β 6Γ speedup
- Key innovation: Draft model conditions on target model's hidden states (KV injection)
- Result: Output identical to greedy autoregressive generation
Demo Tabs
| Tab | What it does |
|---|---|
| π Quick Start | Select a model, enter a prompt, generate code & see simulated results |
| π οΈ Convert Drafter | Get the uv command to convert official drafters to MLX format |
| π Training | Code template to train custom drafters for unsupported models |
| π₯οΈ Server | Commands to start an OpenAI-compatible local server |
| π Benchmarks | Performance table: 6Γ speedup across 6 models |
| π Architecture | Deep dive into how block diffusion + KV injection works |
| π¦ Installation | uv and pip setup instructions |
Supported Models
- Qwen3 (4B, 8B)
- Qwen3.5 (4B, 9B, 27B)
- Qwen3.6 (27B, 35B-A3B)
- LLaMA-3.1 (8B)
- Gemma-4 (31B)
Quick Start (on your Mac)
# 1. Install uv
brew install uv
# 2. Clone and setup
git clone https://huggingface.co/tritesh/dflash-mlx-universal.git
cd dflash-mlx-universal
./setup_uv.sh
# 3. Convert a drafter
uv run python -m dflash_mlx.convert \
--model z-lab/Qwen3-4B-DFlash-b16 \
--output ./Qwen3-4B-DFlash-mlx
# 4. Generate
uv run python examples/qwen3_4b_demo.py
Links
- Paper: arXiv:2602.06036
- Repository: tritesh/dflash-mlx-universal
- Package:
dflash-mlx-universal(PyPI compatible)