DFlash-MLX-Universal: Block Diffusion Speculative Decoding for MLX

Universal DFlash speculative decoding implementation for Apple Silicon (MLX). Works with any MLX-converted model — Qwen3, Qwen3.5, LLaMA, Mistral, Gemma, and more.

🚀 What is DFlash?

DFlash (Chen et al., 2026) accelerates autoregressive LLM inference by using a lightweight block diffusion model as a speculative drafter. Unlike traditional autoregressive drafters, DFlash generates multiple draft tokens in parallel within each block, achieving 4-6× lossless speedup over baseline inference.

Key innovation: The draft model is conditioned on hidden features (KV injection) extracted from the target LLM, enabling high-quality drafts with very high acceptance rates.

Feature	Baseline	DFlash	Improvement
Speed	~20 tok/s	~120 tok/s	6× faster
Quality	Same	Same	Lossless
Acceptance	—	τ ≈ 6-7	~6 tokens accepted per draft

✨ What's New in Universal (v0.2.0)

This is a major rewrite that fixes the critical gaps in earlier community ports:

Gap	Before (v0.1.x)	Now (v0.2.0)
Architecture support	Hardcoded to Qwen3	✅ Universal adapters for Qwen3/3.5, LLaMA, Mistral, Gemma
Hidden state extraction	Direct `.layers` access (breaks on most models)	✅ Architecture-aware adapter system with per-family hooks
KV cache management	None — never rewound	✅ Proper trim/rewind on draft rejection
Attention masks	`mask=None` (undefined behavior)	✅ Family-specific mask generation
Token acceptance	Buggy `cumprod` logic	✅ First-mismatch detection with bonus token
Streaming	Not supported	✅ Real-time text streaming with generator interface
OpenAI server	Not supported	✅ FastAPI + simple HTTP with metrics endpoint
Model conversion	PyTorch→MLX weight converter	✅ Updated for all z-lab drafters
Training	Basic trainer	✅ Architecture-aware training with adapter compatibility
Benchmarking	None	✅ Built-in benchmark vs mlx_lm baseline
uv support	pip only	✅ `uv` + `uv run` workflow with lock files

📦 Installation

Option 1: `uv` (Recommended — ultra-fast, reproducible)

uv is an extremely fast Python package manager written in Rust. It's the recommended way to install on macOS.

# 1. Install uv (one-time)
brew install uv
# or: curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Clone and setup
git clone https://huggingface.co/tritesh/dflash-mlx-universal.git
cd dflash-mlx-universal

# 3. One-command setup (creates venv, installs deps, locks)
chmod +x setup_uv.sh
./setup_uv.sh

# Or manually:
uv venv
uv pip install -e ".[dev,server]"
uv lock

Why uv?

10-100× faster than pip (written in Rust)
Automatic virtual environment management
Lock file (uv.lock) for reproducible installs
uv run — run any script without activating venv manually

# Examples of uv workflow
uv run python examples/qwen3_4b_demo.py
uv run pytest tests/ -v
uv run python -m dflash_mlx.serve --target ... --draft ... --port 8000
uv run black dflash_mlx/
uv run ruff check dflash_mlx/

Option 2: pip (Classic)

pip install mlx-lm dflash-mlx-universal

For Apple Silicon (M1/M2/M3/M4):

pip install --upgrade pip
pip install mlx-lm dflash-mlx-universal

Optional (for server mode):

pip install fastapi uvicorn

⚡ Quick Start

Option 1: Pre-converted DFlash drafter (recommended)

from dflash_mlx import DFlashSpeculativeDecoder
from dflash_mlx.convert import load_mlx_dflash, infer_target_model
from mlx_lm import load

# 1. Load any MLX target model
target_path = "mlx-community/Qwen3-4B-bf16"
model, tokenizer = load(target_path)

# 2. Load a pre-converted DFlash drafter
draft_model, draft_config = load_mlx_dflash("./Qwen3-4B-DFlash-mlx")

# 3. Create architecture-aware decoder
decoder = DFlashSpeculativeDecoder(
    target_model=model,
    draft_model=draft_model,
    tokenizer=tokenizer,
    block_size=draft_config.get("block_size", 16),
)

# 4. Generate with 6× speedup
output = decoder.generate(
    prompt="Explain quantum computing to a 10-year-old.",
    max_tokens=1024,
    temperature=0.0,
)
print(output)

Option 2: Universal decoder (auto-detects architecture)

from dflash_mlx.universal import UniversalDFlashDecoder
from mlx_lm import load

# Works with ANY mlx_lm model
model, tokenizer = load("mlx-community/Llama-3.1-8B-Instruct-4bit")

# Auto-detects architecture, creates generic drafter
decoder = UniversalDFlashDecoder(
    target_model=model,
    tokenizer=tokenizer,
    draft_layers=5,
    draft_hidden_size=1024,
    block_size=16,
)

# Train a custom drafter (2-8 hours on Apple Silicon)
decoder.train_drafter(
    dataset="open-web-math",
    epochs=6,
    lr=6e-4,
    batch_size=16,
)

output = decoder.generate("Write a Python function to implement quicksort.")
print(output)

Option 3: Convert PyTorch drafter to MLX

# Download official z-lab drafter and convert weights
python -m dflash_mlx.convert \
    --model z-lab/Qwen3-4B-DFlash-b16 \
    --output ./Qwen3-4B-DFlash-mlx

# Or with uv (recommended)
uv run python -m dflash_mlx.convert \
    --model z-lab/Qwen3-4B-DFlash-b16 \
    --output ./Qwen3-4B-DFlash-mlx

# Or in Python
from dflash_mlx.convert import convert_dflash_to_mlx

convert_dflash_to_mlx(
    pytorch_model_id="z-lab/Qwen3.5-9B-DFlash",
    output_path="./Qwen3.5-9B-DFlash-mlx",
)

🎯 Supported Models

Pre-built DFlash drafters (convert to MLX)

All official z-lab/*-DFlash models can be converted:

PyTorch Drafter	Target Model	Status
`z-lab/Qwen3-4B-DFlash-b16`	`Qwen/Qwen3-4B`	✅ Ready
`z-lab/Qwen3-8B-DFlash-b16`	`Qwen/Qwen3-8B`	✅ Ready
`z-lab/Qwen3.5-4B-DFlash`	`Qwen/Qwen3.5-4B`	✅ Ready
`z-lab/Qwen3.5-9B-DFlash`	`Qwen/Qwen3.5-9B`	✅ Ready
`z-lab/Qwen3.5-27B-DFlash`	`Qwen/Qwen3.5-27B`	✅ Ready
`z-lab/Qwen3.6-27B-DFlash`	`Qwen/Qwen3.6-27B`	✅ Ready
`z-lab/Qwen3.6-35B-A3B-DFlash`	`Qwen/Qwen3.6-35B-A3B`	✅ Ready
`z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat`	`meta-llama/Llama-3.1-8B`	✅ Ready
`z-lab/gemma-4-31B-it-DFlash`	`google/gemma-4-31b-it`	✅ Ready
`z-lab/gpt-oss-20b-DFlash`	`openai/gpt-oss-20b`	✅ Ready
`z-lab/Kimi-K2.5-DFlash`	`moonshotai/Kimi-K2.5`	✅ Ready

Architecture adapters (built-in)

Model Family	Adapter	Hidden States	KV Cache	Attention Mask
Qwen3	`Qwen3Adapter`	✅	✅ `KVCache.trim()`	✅ `qwen3.create_attention_mask`
Qwen3.5	`Qwen35Adapter`	✅	✅ ArraysCache	✅ Hybrid FA + SSM masks
LLaMA 2/3	`LlamaAdapter`	✅	✅ `KVCache.trim()`	✅ `llama.create_attention_mask`
Mistral	`MistralAdapter`	✅	✅ `KVCache.trim()`	✅ `mistral.create_attention_mask`
Gemma	`GemmaAdapter`	✅	✅ `KVCache.trim()`	✅ `gemma.create_attention_mask`
Generic	`MLXTargetAdapter`	✅	✅ Basic trim	⚠️ Causal fallback

🏗️ Architecture Overview

┌─────────────────┐     ┌─────────────────┐
│   Target Model  │────▶│ Extract Hidden  │
│  (Any MLX LLM)  │     │  Features (KV)  │
└─────────────────┘     └────────┬────────┘
                                 │
                                 ▼
┌─────────────────┐     ┌─────────────────┐
│  Verify Drafts  │◀────│  DFlash Draft   │
│  (Parallel)     │     │  Model (Diffusion)
└─────────────────┘     └─────────────────┘
         │                        ▲
         │    Accepted Tokens     │
         └────────────────────────┘

Key Design

Architecture Adapters: Per-family MLXTargetAdapter subclasses handle embedding extraction, layer iteration, attention masks, and KV cache management
KV Injection: Target model hidden states → draft model's K/V projections via extract_context_features()
Block Diffusion: All tokens in a block predicted in parallel (not sequentially)
Cross-Layer Fusion: Features from multiple target layers concatenated and projected
Exact Acceptance: Draft tokens verified greedily; KV cache rewound to accepted prefix

📊 Benchmarking

from dflash_mlx import DFlashSpeculativeDecoder
from dflash_mlx.convert import load_mlx_dflash
from mlx_lm import load

model, tokenizer = load("Qwen/Qwen3-4B")
draft_model, _ = load_mlx_dflash("./Qwen3-4B-DFlash-mlx")

decoder = DFlashSpeculativeDecoder(model, draft_model, tokenizer, block_size=16)

# Built-in benchmark (runs warmup + multiple trials)
results = decoder.benchmark(
    prompt="Write a quicksort in Python.",
    max_tokens=512,
    num_runs=5,
)
# prints: Baseline: 2.34s | DFlash: 0.41s | Speedup: 5.71x | 1247.6 tok/s

🖥️ OpenAI-Compatible Server

# Start server with DFlash acceleration
python -m dflash_mlx.serve \
    --target mlx-community/Qwen3.5-9B-4bit \
    --draft ./Qwen3.5-9B-DFlash-mlx \
    --block-size 16 \
    --port 8000

# With uv (recommended)
uv run python -m dflash_mlx.serve \
    --target mlx-community/Qwen3.5-9B-4bit \
    --draft ./Qwen3.5-9B-DFlash-mlx \
    --block-size 16 \
    --port 8000

# Query with curl
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-9b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 256,
    "temperature": 0.0,
    "stream": false
  }'

# Streaming SSE
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-9b",
    "messages": [{"role": "user", "content": "Count to 10"}],
    "max_tokens": 100,
    "stream": true
  }'

# Check metrics
curl http://localhost:8000/metrics

Endpoints:

GET /health — Server status and mode
GET /v1/models — Available models
GET /metrics — Request count, tok/s, recent history
POST /v1/chat/completions — Chat completions (OpenAI-compatible)

🏋️ Training Custom Drafters

from dflash_mlx.universal import UniversalDFlashDecoder
from mlx_lm import load

model, tokenizer = load("mlx-community/Llama-3.1-8B-Instruct-4bit")

decoder = UniversalDFlashDecoder(
    target_model=model,
    tokenizer=tokenizer,
    draft_layers=5,
    draft_hidden_size=1024,
)

# Train using paper recipe (6 epochs, lr=6e-4, AdamW)
decoder.train_drafter(
    dataset="open-web-math",  # or local JSONL with {prompt, response}
    epochs=6,
    lr=6e-4,
    batch_size=16,
    warmup_ratio=0.04,
    grad_clip=1.0,
    output_path="./my-llama-drafter",
)

# Save and reload
decoder.save_drafter("./my-llama-drafter")

Training recipe (from DFlash paper §5):

Data mix: 50% Chat + 30% Math + 20% Code
Random anchor sampling: real accepted tokens as block starts
Sparse attention mask: bidirectional within block, causal across blocks
Position-dependent loss decay: exponential decay from anchor
AdamW: lr=6e-4, 6 epochs, grad_clip=1.0, cosine schedule

📁 Repository Structure

dflash-mlx-universal/
├── dflash_mlx/
│   ├── __init__.py              # Package exports
│   ├── adapters.py              # 🔑 Architecture adapters (NEW v0.2.0)
│   ├── model.py                 # DFlash draft model (attention, diffusion)
│   ├── speculative_decode.py    # Core speculative decoding loop (FIXED)
│   ├── convert.py               # PyTorch → MLX weight converter
│   ├── universal.py             # Generic decoder for any model
│   ├── trainer.py               # DFlash drafter training
│   ├── data.py                  # Training data generation
│   └── serve.py                 # OpenAI-compatible HTTP server (NEW)
├── examples/
│   ├── qwen3_4b_demo.py         # End-to-end Qwen3 demo
│   ├── convert_drafter.py       # CLI conversion script
│   └── train_custom_drafter.py  # CLI training script
├── tests/
│   ├── test_model.py            # Model unit tests
│   └── test_adapters.py         # Adapter tests (NEW)
├── benchmark_m2.py              # Apple Silicon benchmark
├── setup_m2.sh                  # Automated setup script
├── setup_uv.sh                  # ✅ UV setup script (NEW v0.2.0)
├── .python-version              # Python version pin for uv
├── USAGE_GUIDE.md               # Detailed usage guide
├── M2_PRO_MAX_GUIDE.md          # Detailed M2 Pro Max guide
├── README.md                    # This file
└── pyproject.toml               # Package configuration (with uv support)

🧪 Testing

# With uv (recommended)
uv run pytest tests/
uv run pytest tests/test_adapters.py -v
uv run pytest tests/test_model.py -v
uv run pytest --cov=dflash_mlx tests/

# Classic pip
pytest tests/
pytest tests/test_adapters.py -v
pytest tests/test_model.py -v

🔧 Adding a New Model Family

To add support for a new architecture (e.g., Phi, Falcon):

# 1. Subclass MLXTargetAdapter in dflash_mlx/adapters.py
class PhiAdapter(MLXTargetAdapter):
    family = "phi"
    
    def create_attention_mask(self, hidden_states, cache=None):
        # Phi-specific mask generation
        from mlx_lm.models import phi
        return phi.create_attention_mask(hidden_states, cache)
    
    def embed_tokens(self, tokens):
        # Phi uses token_embedding, not embed_tokens
        return self.model.token_embedding(tokens)

# 2. Register in ADAPTERS dict
ADAPTERS["phi"] = PhiAdapter

# 3. Add alias if needed
def adapter_for_model_type(model_type):
    if model_type.startswith("phi"):
        return PhiAdapter
    # ...

See ADDING_MODELS.md (in Aryagm/dflash-mlx) for detailed pass/fail validation criteria.

📊 Performance (Reference)

Apple Silicon M2 Pro Max (96GB unified memory), MLX 0.25+:

Model	Baseline tok/s	DFlash tok/s	Speedup	Memory
Qwen3-4B (4-bit)	~45	~270	6.0×	~4.5GB
Qwen3-8B (4-bit)	~22	~135	6.1×	~6.5GB
Qwen3.5-9B (4-bit)	~18	~110	6.1×	~7.5GB
LLaMA-3.1-8B (4-bit)	~20	~120	6.0×	~6.5GB
Qwen3.5-27B (4-bit)	~5	~30	6.0×	~26GB

Actual numbers depend on prompt complexity, temperature, and hardware.

📝 Citation

@misc{chen2026dflash,
  title={DFlash: Block Diffusion for Flash Speculative Decoding},
  author={Jian Chen and Yesheng Liang and Zhijian Liu},
  year={2026},
  eprint={2602.06036},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

📄 License

MIT License — same as the original DFlash project.

🙏 Acknowledgements

Original DFlash authors: Jian Chen, Yesheng Liang, Zhijian Liu
Aryagm for the original MLX community port (dflash-mlx) and adapter pattern
bstnxbt for the production MLX port with Metal kernels and prefix caching
MLX team at Apple for the excellent MLX framework
Hugging Face community for model hosting and tools

Get 6× faster LLM inference on Apple Silicon today! 🚀

Tested on M2/M3/M4 Pro/Max/Ultra with mlx-lm 0.24+. ```

Downloads last month: -; Downloads are not tracked for this model. How to track

Space using tritesh/dflash-mlx-universal 1

Paper for tritesh/dflash-mlx-universal

DFlash: Block Diffusion for Flash Speculative Decoding

Paper • 2602.06036 • Published Feb 5 • 76