tritesh's picture
Update ML Intern artifact metadata
5c66bee verified

A newer version of the Gradio SDK is available: 6.14.0

Upgrade
metadata
title: 'DFlash-MLX-Universal: Interactive Demo'
emoji: πŸš€
colorFrom: purple
colorTo: blue
sdk: gradio
sdk_version: 5.0.0
app_file: app.py
pinned: true
tags:
  - ml-intern

πŸš€ DFlash-MLX-Universal Demo

Block Diffusion Speculative Decoding for Apple Silicon (MLX)

This interactive demo showcases DFlash β€” a block diffusion model that accelerates LLM inference by 6Γ— on Apple Silicon with lossless output.

What is DFlash?

  • Traditional speculative decoding: Drafts 1 token at a time β†’ 2-3Γ— speedup
  • DFlash: Drafts 16 tokens in parallel via diffusion β†’ 6Γ— speedup
  • Key innovation: Draft model conditions on target model's hidden states (KV injection)
  • Result: Output identical to greedy autoregressive generation

Demo Tabs

Tab What it does
πŸƒ Quick Start Select a model, enter a prompt, generate code & see simulated results
πŸ› οΈ Convert Drafter Get the uv command to convert official drafters to MLX format
πŸŽ“ Training Code template to train custom drafters for unsupported models
πŸ–₯️ Server Commands to start an OpenAI-compatible local server
πŸ“Š Benchmarks Performance table: 6Γ— speedup across 6 models
πŸ“– Architecture Deep dive into how block diffusion + KV injection works
πŸ“¦ Installation uv and pip setup instructions

Supported Models

  • Qwen3 (4B, 8B)
  • Qwen3.5 (4B, 9B, 27B)
  • Qwen3.6 (27B, 35B-A3B)
  • LLaMA-3.1 (8B)
  • Gemma-4 (31B)

Quick Start (on your Mac)

# 1. Install uv
brew install uv

# 2. Clone and setup
git clone https://huggingface.co/tritesh/dflash-mlx-universal.git
cd dflash-mlx-universal
./setup_uv.sh

# 3. Convert a drafter
uv run python -m dflash_mlx.convert \
    --model z-lab/Qwen3-4B-DFlash-b16 \
    --output ./Qwen3-4B-DFlash-mlx

# 4. Generate
uv run python examples/qwen3_4b_demo.py

Links