Upload README.md

f693d4a verified 1 day ago

16.8 kB

	---
	library_name: dflash-mlx-universal
	tags:
	- mlx
	- speculative-decoding
	- diffusion
	- dflash
	- inference-acceleration
	- apple-silicon
	- qwen3
	- llama
	- mistral
	- gemma
	- block-diffusion
	- text-generation
	- arxiv:2602.06036
	license: mit
	---

	# DFlash-MLX-Universal: Block Diffusion Speculative Decoding for MLX

	> Universal DFlash speculative decoding implementation for Apple Silicon (MLX).
	> Works with any MLX-converted model — Qwen3, Qwen3.5, LLaMA, Mistral, Gemma, and more.

	[![Python](https://img.shields.io/badge/python-3.9%2B-blue)](https://python.org)
	[![MLX](https://img.shields.io/badge/MLX-latest-red)](https://github.com/ml-explore/mlx)
	[![License](https://img.shields.io/badge/license-MIT-green)](LICENSE)
	[![uv](https://img.shields.io/badge/uv-astral-purple)](https://github.com/astral-sh/uv)

	---

	## 🚀 What is DFlash?

	[DFlash](https://arxiv.org/abs/2602.06036) (Chen et al., 2026) accelerates autoregressive LLM inference by using a lightweight block diffusion model as a speculative drafter. Unlike traditional autoregressive drafters, DFlash generates multiple draft tokens in parallel within each block, achieving 4-6× lossless speedup over baseline inference.

	Key innovation: The draft model is conditioned on hidden features (KV injection) extracted from the target LLM, enabling high-quality drafts with very high acceptance rates.

	\| Feature \| Baseline \| DFlash \| Improvement \|
	\|---------\|----------\|--------\|-------------\|
	\| Speed \| ~20 tok/s \| ~120 tok/s \| 6× faster \|
	\| Quality \| Same \| Same \| Lossless \|
	\| Acceptance \| — \| τ ≈ 6-7 \| ~6 tokens accepted per draft \|

	---

	## ✨ What's New in Universal (v0.2.0)

	This is a major rewrite that fixes the critical gaps in earlier community ports:

	\| Gap \| Before (v0.1.x) \| Now (v0.2.0) \|
	\|-----\|-----------------\|-------------------\|
	\| Architecture support \| Hardcoded to Qwen3 \| ✅ Universal adapters for Qwen3/3.5, LLaMA, Mistral, Gemma \|
	\| Hidden state extraction \| Direct `.layers` access (breaks on most models) \| ✅ Architecture-aware adapter system with per-family hooks \|
	\| KV cache management \| None — never rewound \| ✅ Proper trim/rewind on draft rejection \|
	\| Attention masks \| `mask=None` (undefined behavior) \| ✅ Family-specific mask generation \|
	\| Token acceptance \| Buggy `cumprod` logic \| ✅ First-mismatch detection with bonus token \|
	\| Streaming \| Not supported \| ✅ Real-time text streaming with generator interface \|
	\| OpenAI server \| Not supported \| ✅ FastAPI + simple HTTP with metrics endpoint \|
	\| Model conversion \| PyTorch→MLX weight converter \| ✅ Updated for all z-lab drafters \|
	\| Training \| Basic trainer \| ✅ Architecture-aware training with adapter compatibility \|
	\| Benchmarking \| None \| ✅ Built-in benchmark vs mlx_lm baseline \|
	\| uv support \| pip only \| ✅ `uv` + `uv run` workflow with lock files \|

	---

	## 📦 Installation

	### Option 1: `uv` (Recommended — ultra-fast, reproducible)

	[`uv`](https://github.com/astral-sh/uv) is an extremely fast Python package manager written in Rust. It's the recommended way to install on macOS.

	```bash
	# 1. Install uv (one-time)
	brew install uv
	# or: curl -LsSf https://astral.sh/uv/install.sh \| sh

	# 2. Clone and setup
	git clone https://huggingface.co/tritesh/dflash-mlx-universal.git
	cd dflash-mlx-universal

	# 3. One-command setup (creates venv, installs deps, locks)
	chmod +x setup_uv.sh
	./setup_uv.sh

	# Or manually:
	uv venv
	uv pip install -e ".[dev,server]"
	uv lock
	```

	Why `uv`?
	- 10-100× faster than `pip` (written in Rust)
	- Automatic virtual environment management
	- Lock file (`uv.lock`) for reproducible installs
	- `uv run` — run any script without activating venv manually

	```bash
	# Examples of uv workflow
	uv run python examples/qwen3_4b_demo.py
	uv run pytest tests/ -v
	uv run python -m dflash_mlx.serve --target ... --draft ... --port 8000
	uv run black dflash_mlx/
	uv run ruff check dflash_mlx/
	```

	### Option 2: pip (Classic)

	```bash
	pip install mlx-lm dflash-mlx-universal
	```

	For Apple Silicon (M1/M2/M3/M4):
	```bash
	pip install --upgrade pip
	pip install mlx-lm dflash-mlx-universal
	```

	Optional (for server mode):
	```bash
	pip install fastapi uvicorn
	```

	---

	## ⚡ Quick Start

	### Option 1: Pre-converted DFlash drafter (recommended)

	```python
	from dflash_mlx import DFlashSpeculativeDecoder
	from dflash_mlx.convert import load_mlx_dflash, infer_target_model
	from mlx_lm import load

	# 1. Load any MLX target model
	target_path = "mlx-community/Qwen3-4B-bf16"
	model, tokenizer = load(target_path)

	# 2. Load a pre-converted DFlash drafter
	draft_model, draft_config = load_mlx_dflash("./Qwen3-4B-DFlash-mlx")

	# 3. Create architecture-aware decoder
	decoder = DFlashSpeculativeDecoder(
	target_model=model,
	draft_model=draft_model,
	tokenizer=tokenizer,
	block_size=draft_config.get("block_size", 16),
	)

	# 4. Generate with 6× speedup
	output = decoder.generate(
	prompt="Explain quantum computing to a 10-year-old.",
	max_tokens=1024,
	temperature=0.0,
	)
	print(output)
	```

	### Option 2: Universal decoder (auto-detects architecture)

	```python
	from dflash_mlx.universal import UniversalDFlashDecoder
	from mlx_lm import load

	# Works with ANY mlx_lm model
	model, tokenizer = load("mlx-community/Llama-3.1-8B-Instruct-4bit")

	# Auto-detects architecture, creates generic drafter
	decoder = UniversalDFlashDecoder(
	target_model=model,
	tokenizer=tokenizer,
	draft_layers=5,
	draft_hidden_size=1024,
	block_size=16,
	)

	# Train a custom drafter (2-8 hours on Apple Silicon)
	decoder.train_drafter(
	dataset="open-web-math",
	epochs=6,
	lr=6e-4,
	batch_size=16,
	)

	output = decoder.generate("Write a Python function to implement quicksort.")
	print(output)
	```

	### Option 3: Convert PyTorch drafter to MLX

	```bash
	# Download official z-lab drafter and convert weights
	python -m dflash_mlx.convert \
	--model z-lab/Qwen3-4B-DFlash-b16 \
	--output ./Qwen3-4B-DFlash-mlx

	# Or with uv (recommended)
	uv run python -m dflash_mlx.convert \
	--model z-lab/Qwen3-4B-DFlash-b16 \
	--output ./Qwen3-4B-DFlash-mlx

	# Or in Python
	from dflash_mlx.convert import convert_dflash_to_mlx

	convert_dflash_to_mlx(
	pytorch_model_id="z-lab/Qwen3.5-9B-DFlash",
	output_path="./Qwen3.5-9B-DFlash-mlx",
	)
	```

	---

	## 🎯 Supported Models

	### Pre-built DFlash drafters (convert to MLX)

	All official `z-lab/*-DFlash` models can be converted:

	\| PyTorch Drafter \| Target Model \| Status \|
	\|----------------\|-------------\|--------\|
	\| `z-lab/Qwen3-4B-DFlash-b16` \| `Qwen/Qwen3-4B` \| ✅ Ready \|
	\| `z-lab/Qwen3-8B-DFlash-b16` \| `Qwen/Qwen3-8B` \| ✅ Ready \|
	\| `z-lab/Qwen3.5-4B-DFlash` \| `Qwen/Qwen3.5-4B` \| ✅ Ready \|
	\| `z-lab/Qwen3.5-9B-DFlash` \| `Qwen/Qwen3.5-9B` \| ✅ Ready \|
	\| `z-lab/Qwen3.5-27B-DFlash` \| `Qwen/Qwen3.5-27B` \| ✅ Ready \|
	\| `z-lab/Qwen3.6-27B-DFlash` \| `Qwen/Qwen3.6-27B` \| ✅ Ready \|
	\| `z-lab/Qwen3.6-35B-A3B-DFlash` \| `Qwen/Qwen3.6-35B-A3B` \| ✅ Ready \|
	\| `z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat` \| `meta-llama/Llama-3.1-8B` \| ✅ Ready \|
	\| `z-lab/gemma-4-31B-it-DFlash` \| `google/gemma-4-31b-it` \| ✅ Ready \|
	\| `z-lab/gpt-oss-20b-DFlash` \| `openai/gpt-oss-20b` \| ✅ Ready \|
	\| `z-lab/Kimi-K2.5-DFlash` \| `moonshotai/Kimi-K2.5` \| ✅ Ready \|

	### Architecture adapters (built-in)

	\| Model Family \| Adapter \| Hidden States \| KV Cache \| Attention Mask \|
	\|-------------\|---------\|---------------\|----------\|----------------\|
	\| Qwen3 \| `Qwen3Adapter` \| ✅ \| ✅ `KVCache.trim()` \| ✅ `qwen3.create_attention_mask` \|
	\| Qwen3.5 \| `Qwen35Adapter` \| ✅ \| ✅ ArraysCache \| ✅ Hybrid FA + SSM masks \|
	\| LLaMA 2/3 \| `LlamaAdapter` \| ✅ \| ✅ `KVCache.trim()` \| ✅ `llama.create_attention_mask` \|
	\| Mistral \| `MistralAdapter` \| ✅ \| ✅ `KVCache.trim()` \| ✅ `mistral.create_attention_mask` \|
	\| Gemma \| `GemmaAdapter` \| ✅ \| ✅ `KVCache.trim()` \| ✅ `gemma.create_attention_mask` \|
	\| Generic \| `MLXTargetAdapter` \| ✅ \| ✅ Basic trim \| ⚠️ Causal fallback \|

	---

	## 🏗️ Architecture Overview

	```
	┌─────────────────┐ ┌─────────────────┐
	│ Target Model │────▶│ Extract Hidden │
	│ (Any MLX LLM) │ │ Features (KV) │
	└─────────────────┘ └────────┬────────┘
	│
	▼
	┌─────────────────┐ ┌─────────────────┐
	│ Verify Drafts │◀────│ DFlash Draft │
	│ (Parallel) │ │ Model (Diffusion)
	└─────────────────┘ └─────────────────┘
	│ ▲
	│ Accepted Tokens │
	└────────────────────────┘
	```

	### Key Design

	1. Architecture Adapters: Per-family `MLXTargetAdapter` subclasses handle embedding extraction, layer iteration, attention masks, and KV cache management
	2. KV Injection: Target model hidden states → draft model's K/V projections via `extract_context_features()`
	3. Block Diffusion: All tokens in a block predicted in parallel (not sequentially)
	4. Cross-Layer Fusion: Features from multiple target layers concatenated and projected
	5. Exact Acceptance: Draft tokens verified greedily; KV cache rewound to accepted prefix

	---

	## 📊 Benchmarking

	```python
	from dflash_mlx import DFlashSpeculativeDecoder
	from dflash_mlx.convert import load_mlx_dflash
	from mlx_lm import load

	model, tokenizer = load("Qwen/Qwen3-4B")
	draft_model, _ = load_mlx_dflash("./Qwen3-4B-DFlash-mlx")

	decoder = DFlashSpeculativeDecoder(model, draft_model, tokenizer, block_size=16)

	# Built-in benchmark (runs warmup + multiple trials)
	results = decoder.benchmark(
	prompt="Write a quicksort in Python.",
	max_tokens=512,
	num_runs=5,
	)
	# prints: Baseline: 2.34s \| DFlash: 0.41s \| Speedup: 5.71x \| 1247.6 tok/s
	```

	---

	## 🖥️ OpenAI-Compatible Server

	```bash
	# Start server with DFlash acceleration
	python -m dflash_mlx.serve \
	--target mlx-community/Qwen3.5-9B-4bit \
	--draft ./Qwen3.5-9B-DFlash-mlx \
	--block-size 16 \
	--port 8000

	# With uv (recommended)
	uv run python -m dflash_mlx.serve \
	--target mlx-community/Qwen3.5-9B-4bit \
	--draft ./Qwen3.5-9B-DFlash-mlx \
	--block-size 16 \
	--port 8000

	# Query with curl
	curl http://localhost:8000/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "qwen3.5-9b",
	"messages": [{"role": "user", "content": "Hello!"}],
	"max_tokens": 256,
	"temperature": 0.0,
	"stream": false
	}'

	# Streaming SSE
	curl http://localhost:8000/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "qwen3.5-9b",
	"messages": [{"role": "user", "content": "Count to 10"}],
	"max_tokens": 100,
	"stream": true
	}'

	# Check metrics
	curl http://localhost:8000/metrics
	```

	Endpoints:
	- `GET /health` — Server status and mode
	- `GET /v1/models` — Available models
	- `GET /metrics` — Request count, tok/s, recent history
	- `POST /v1/chat/completions` — Chat completions (OpenAI-compatible)

	---

	## 🏋️ Training Custom Drafters

	```python
	from dflash_mlx.universal import UniversalDFlashDecoder
	from mlx_lm import load

	model, tokenizer = load("mlx-community/Llama-3.1-8B-Instruct-4bit")

	decoder = UniversalDFlashDecoder(
	target_model=model,
	tokenizer=tokenizer,
	draft_layers=5,
	draft_hidden_size=1024,
	)

	# Train using paper recipe (6 epochs, lr=6e-4, AdamW)
	decoder.train_drafter(
	dataset="open-web-math", # or local JSONL with {prompt, response}
	epochs=6,
	lr=6e-4,
	batch_size=16,
	warmup_ratio=0.04,
	grad_clip=1.0,
	output_path="./my-llama-drafter",
	)

	# Save and reload
	decoder.save_drafter("./my-llama-drafter")
	```

	Training recipe (from DFlash paper §5):
	- Data mix: 50% Chat + 30% Math + 20% Code
	- Random anchor sampling: real accepted tokens as block starts
	- Sparse attention mask: bidirectional within block, causal across blocks
	- Position-dependent loss decay: exponential decay from anchor
	- AdamW: lr=6e-4, 6 epochs, grad_clip=1.0, cosine schedule

	---

	## 📁 Repository Structure

	```
	dflash-mlx-universal/
	├── dflash_mlx/
	│ ├── __init__.py # Package exports
	│ ├── adapters.py # 🔑 Architecture adapters (NEW v0.2.0)
	│ ├── model.py # DFlash draft model (attention, diffusion)
	│ ├── speculative_decode.py # Core speculative decoding loop (FIXED)
	│ ├── convert.py # PyTorch → MLX weight converter
	│ ├── universal.py # Generic decoder for any model
	│ ├── trainer.py # DFlash drafter training
	│ ├── data.py # Training data generation
	│ └── serve.py # OpenAI-compatible HTTP server (NEW)
	├── examples/
	│ ├── qwen3_4b_demo.py # End-to-end Qwen3 demo
	│ ├── convert_drafter.py # CLI conversion script
	│ └── train_custom_drafter.py # CLI training script
	├── tests/
	│ ├── test_model.py # Model unit tests
	│ └── test_adapters.py # Adapter tests (NEW)
	├── benchmark_m2.py # Apple Silicon benchmark
	├── setup_m2.sh # Automated setup script
	├── setup_uv.sh # ✅ UV setup script (NEW v0.2.0)
	├── .python-version # Python version pin for uv
	├── USAGE_GUIDE.md # Detailed usage guide
	├── M2_PRO_MAX_GUIDE.md # Detailed M2 Pro Max guide
	├── README.md # This file
	└── pyproject.toml # Package configuration (with uv support)
	```

	---

	## 🧪 Testing

	```bash
	# With uv (recommended)
	uv run pytest tests/
	uv run pytest tests/test_adapters.py -v
	uv run pytest tests/test_model.py -v
	uv run pytest --cov=dflash_mlx tests/

	# Classic pip
	pytest tests/
	pytest tests/test_adapters.py -v
	pytest tests/test_model.py -v
	```

	---

	## 🔧 Adding a New Model Family

	To add support for a new architecture (e.g., Phi, Falcon):

	```python
	# 1. Subclass MLXTargetAdapter in dflash_mlx/adapters.py
	class PhiAdapter(MLXTargetAdapter):
	family = "phi"

	def create_attention_mask(self, hidden_states, cache=None):
	# Phi-specific mask generation
	from mlx_lm.models import phi
	return phi.create_attention_mask(hidden_states, cache)

	def embed_tokens(self, tokens):
	# Phi uses token_embedding, not embed_tokens
	return self.model.token_embedding(tokens)

	# 2. Register in ADAPTERS dict
	ADAPTERS["phi"] = PhiAdapter

	# 3. Add alias if needed
	def adapter_for_model_type(model_type):
	if model_type.startswith("phi"):
	return PhiAdapter
	# ...
	```

	See `ADDING_MODELS.md` (in Aryagm/dflash-mlx) for detailed pass/fail validation criteria.

	---

	## 📊 Performance (Reference)

	Apple Silicon M2 Pro Max (96GB unified memory), MLX 0.25+:

	\| Model \| Baseline tok/s \| DFlash tok/s \| Speedup \| Memory \|
	\|-------\|---------------\|-------------\|-------------\|--------\|
	\| Qwen3-4B (4-bit) \| ~45 \| ~270 \| 6.0× \| ~4.5GB \|
	\| Qwen3-8B (4-bit) \| ~22 \| ~135 \| 6.1× \| ~6.5GB \|
	\| Qwen3.5-9B (4-bit) \| ~18 \| ~110 \| 6.1× \| ~7.5GB \|
	\| LLaMA-3.1-8B (4-bit) \| ~20 \| ~120 \| 6.0× \| ~6.5GB \|
	\| Qwen3.5-27B (4-bit) \| ~5 \| ~30 \| 6.0× \| ~26GB \|

	> Actual numbers depend on prompt complexity, temperature, and hardware.

	---

	## 📝 Citation

	```bibtex
	@misc{chen2026dflash,
	title={DFlash: Block Diffusion for Flash Speculative Decoding},
	author={Jian Chen and Yesheng Liang and Zhijian Liu},
	year={2026},
	eprint={2602.06036},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```

	---

	## 📄 License

	MIT License — same as the original DFlash project.

	---

	## 🙏 Acknowledgements

	- Original DFlash authors: Jian Chen, Yesheng Liang, Zhijian Liu
	- Aryagm for the original MLX community port (`dflash-mlx`) and adapter pattern
	- bstnxbt for the production MLX port with Metal kernels and prefix caching
	- MLX team at Apple for the excellent MLX framework
	- Hugging Face community for model hosting and tools

	---

	Get 6× faster LLM inference on Apple Silicon today! 🚀

	> Tested on M2/M3/M4 Pro/Max/Ultra with mlx-lm 0.24+.
	```