Upload USAGE_GUIDE.md

937c2a6 verified 2 days ago

14.9 kB

	# DFlash-MLX-Universal: System Usage Guide

	> How to use `dflash-mlx-universal` on your Apple Silicon Mac (M1/M2/M3/M4)

	---

	## 📋 Prerequisites

	\| Requirement \| Version \| Notes \|
	\|------------\|---------\|-------\|
	\| macOS \| 14+ (Sonoma/Sequoia) \| MLX requires Apple Silicon \|
	\| Python \| 3.9 - 3.12 \| Recommend 3.11 or 3.12 \|
	\| Chip \| M1/M2/M3/M4 (Pro/Max/Ultra) \| Unified memory required for large models \|
	\| Memory \| 16GB+ minimum, 32GB+ recommended \| 96GB for 70B+ models \|

	---

	## 1️⃣ Installation (Recommended: `uv`)

	[`uv`](https://github.com/astral-sh/uv) is an extremely fast Python package manager written in Rust. It's the recommended way to install `dflash-mlx-universal`.

	### Install `uv` (One-time)

	```bash
	# Option A: Homebrew (macOS)
	brew install uv

	# Option B: Official installer
	curl -LsSf https://astral.sh/uv/install.sh \| sh

	# Verify
	uv --version # Should show 0.6.x or higher
	```

	### Install DFlash-MLX-Universal with `uv`

	```bash
	# 1. Clone the repo
	git clone https://huggingface.co/tritesh/dflash-mlx-universal.git
	cd dflash-mlx-universal

	# 2. Create virtual environment with uv (uses .python-version file)
	uv venv

	# 3. Install in editable mode with all dependencies
	uv pip install -e ".[dev,server]"

	# Or install directly from the repo
	uv pip install "git+https://huggingface.co/tritesh/dflash-mlx-universal.git[dev,server]"
	```

	### Alternative: `uv` project workflow (no manual venv)

	```bash
	# 1. Enter project directory
	cd dflash-mlx-universal

	# 2. uv automatically reads pyproject.toml and .python-version
	uv run python -c "import dflash_mlx; print(dflash_mlx.__version__)"

	# 3. Lock dependencies (creates uv.lock)
	uv lock

	# 4. Run any script with automatic dependency resolution
	uv run python examples/qwen3_4b_demo.py

	# 5. Run tests
	uv run pytest tests/ -v

	# 6. Start server
	uv run python -m dflash_mlx.serve --target mlx-community/Qwen3-4B-bf16 --draft ./Qwen3-4B-DFlash-mlx --port 8000
	```

	### With `uv` and dependency groups

	```bash
	# Install only core dependencies
	uv pip install -e .

	# Install with server extras (FastAPI + uvicorn)
	uv pip install -e ".[server]"

	# Install with dev extras (pytest, black, ruff)
	uv pip install -e ".[dev]"

	# Install everything at once
	uv pip install -e ".[dev,server]"
	```

	---

	## 1️⃣-alt Installation (Classic `pip`)

	If you prefer `pip`:

	```bash
	# 1. Create virtual environment
	python3 -m venv .venv-dflash
	source .venv-dflash/bin/activate # On zsh/bash

	# 2. Upgrade pip
	pip install --upgrade pip

	# 3. Install core dependencies
	pip install mlx-lm>=0.24.0 transformers>=4.57.0 huggingface-hub>=0.25.0

	# 4. Install dflash-mlx-universal from your repo
	pip install git+https://huggingface.co/tritesh/dflash-mlx-universal.git

	# Optional: server mode
	pip install fastapi uvicorn
	```

	---

	## 2️⃣ Quick Start — Using a Pre-converted Drafter

	### Step A: Convert an Official DFlash Drafter to MLX

	Official drafters are PyTorch models. You need to convert them to MLX format once:

	```bash
	# With uv (recommended)
	uv run python -m dflash_mlx.convert \
	--model z-lab/Qwen3-4B-DFlash-b16 \
	--output ~/models/dflash/Qwen3-4B-DFlash-mlx

	# With classic pip
	python -m dflash_mlx.convert \
	--model z-lab/Qwen3-4B-DFlash-b16 \
	--output ~/models/dflash/Qwen3-4B-DFlash-mlx
	```

	Supported drafters:
	```bash
	# Qwen3 series
	uv run python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ~/models/dflash/Qwen3-4B-DFlash-mlx
	uv run python -m dflash_mlx.convert --model z-lab/Qwen3-8B-DFlash-b16 --output ~/models/dflash/Qwen3-8B-DFlash-mlx

	# Qwen3.5 series
	uv run python -m dflash_mlx.convert --model z-lab/Qwen3.5-9B-DFlash --output ~/models/dflash/Qwen3.5-9B-DFlash-mlx
	uv run python -m dflash_mlx.convert --model z-lab/Qwen3.5-27B-DFlash --output ~/models/dflash/Qwen3.5-27B-DFlash-mlx

	# Qwen3.6 series
	uv run python -m dflash_mlx.convert --model z-lab/Qwen3.6-27B-DFlash --output ~/models/dflash/Qwen3.6-27B-DFlash-mlx
	uv run python -m dflash_mlx.convert --model z-lab/Qwen3.6-35B-A3B-DFlash --output ~/models/dflash/Qwen3.6-35B-DFlash-mlx

	# LLaMA
	uv run python -m dflash_mlx.convert --model z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat --output ~/models/dflash/LLaMA3.1-8B-DFlash-mlx

	# Gemma
	uv run python -m dflash_mlx.convert --model z-lab/gemma-4-31B-it-DFlash --output ~/models/dflash/gemma-4-31B-DFlash-mlx

	# GPT-OSS
	uv run python -m dflash_mlx.convert --model z-lab/gpt-oss-20b-DFlash --output ~/models/dflash/gpt-oss-20b-DFlash-mlx
	```

	What this does:
	- Downloads PyTorch weights from HF Hub
	- Transposes linear layers (PyTorch → MLX format)
	- Saves as `weights.npz` + `config.json`
	- Creates `model_info.json` with target model mapping

	---

	### Step B: Generate with DFlash Speculative Decoding

	```python
	from mlx_lm import load
	from dflash_mlx import DFlashSpeculativeDecoder
	from dflash_mlx.convert import load_mlx_dflash

	# 1. Load target model (any MLX-converted model)
	model, tokenizer = load("mlx-community/Qwen3-4B-bf16")

	# 2. Load converted DFlash drafter
	draft_model, draft_config = load_mlx_dflash("~/models/dflash/Qwen3-4B-DFlash-mlx")

	# 3. Create decoder (auto-detects architecture via adapters)
	decoder = DFlashSpeculativeDecoder(
	target_model=model,
	draft_model=draft_model,
	tokenizer=tokenizer,
	block_size=draft_config.get("block_size", 16),
	)

	# 4. Generate with 6× speedup
	output = decoder.generate(
	prompt="Write a Python function to implement quicksort.",
	max_tokens=1024,
	temperature=0.0, # Greedy for exact reproduction
	)
	print(output)
	```

	Run with `uv`:
	```bash
	uv run python my_generate_script.py
	```

	Expected output:
	```
	[DFlash] Prefill: processing 12 prompt tokens...
	[DFlash] Starting speculative decoding (block_size=16)...
	[DFlash] Done. Generated 1024 tokens, avg acceptance: 6.23, effective speedup: ~5.8x
	```

	---

	## 3️⃣ Streaming Generation

	For real-time output (chat UI, etc.):

	```python
	from mlx_lm import load
	from dflash_mlx import DFlashSpeculativeDecoder
	from dflash_mlx.convert import load_mlx_dflash

	model, tokenizer = load("mlx-community/Qwen3-4B-bf16")
	draft_model, _ = load_mlx_dflash("~/models/dflash/Qwen3-4B-DFlash-mlx")
	decoder = DFlashSpeculativeDecoder(model, draft_model, tokenizer, block_size=16)

	# Generator-based streaming
	for chunk in decoder.generate(
	prompt="Tell me a story about a robot.",
	max_tokens=512,
	stream=True, # ← Returns generator
	):
	print(chunk, end="", flush=True)
	```

	---

	## 4️⃣ Benchmark Mode

	Compare DFlash vs baseline speed:

	```python
	from mlx_lm import load
	from dflash_mlx import DFlashSpeculativeDecoder
	from dflash_mlx.convert import load_mlx_dflash

	model, tokenizer = load("mlx-community/Qwen3-4B-bf16")
	draft_model, _ = load_mlx_dflash("~/models/dflash/Qwen3-4B-DFlash-mlx")
	decoder = DFlashSpeculativeDecoder(model, draft_model, tokenizer, block_size=16)

	# Run benchmark
	results = decoder.benchmark(
	prompt="Write a quicksort in Python.",
	max_tokens=512,
	num_runs=5,
	)

	print(f"Speedup: {results['speedup']:.2f}x")
	print(f"Tokens/sec: {results['tokens_per_sec']:.1f}")
	```

	Run:
	```bash
	uv run python benchmark_script.py
	```

	Sample results (M2 Pro Max 96GB):
	```
	[Benchmark] Baseline: 2.34s \| DFlash: 0.41s \| Speedup: 5.71x \| 1247.6 tok/s
	```

	---

	## 5️⃣ Universal Decoder (Any Model Without Pre-built Drafter)

	If your model doesn't have a DFlash drafter yet:

	```python
	from mlx_lm import load
	from dflash_mlx.universal import UniversalDFlashDecoder

	# Load ANY mlx_lm model
	model, tokenizer = load("mlx-community/Llama-3.1-8B-Instruct-4bit")

	# UniversalDFlashDecoder:
	# 1. Auto-detects architecture (LLaMA in this case)
	# 2. Creates a generic 5-layer drafter (~500MB)
	# 3. Sets up proper adapter for hidden state extraction

	decoder = UniversalDFlashDecoder(
	target_model=model,
	tokenizer=tokenizer,
	draft_layers=5,
	draft_hidden_size=1024,
	block_size=16,
	)

	# Option A: Train a custom drafter (2-8 hours on Apple Silicon)
	decoder.train_drafter(
	dataset="open-web-math", # or local JSONL
	epochs=6,
	lr=6e-4,
	batch_size=16,
	output_path="~/models/dflash/my-llama-drafter",
	)

	# Option B: Use untrained (low quality, for testing only)
	output = decoder.generate(
	prompt="Hello world!",
	max_tokens=100,
	)
	```

	---

	## 6️⃣ OpenAI-Compatible Server

	Run a local server compatible with OpenAI clients:

	```bash
	# With uv (recommended)
	uv run python -m dflash_mlx.serve \
	--target mlx-community/Qwen3-4B-bf16 \
	--draft ~/models/dflash/Qwen3-4B-DFlash-mlx \
	--block-size 16 \
	--port 8000

	# Or in background
	nohup uv run python -m dflash_mlx.serve \
	--target mlx-community/Qwen3-4B-bf16 \
	--draft ~/models/dflash/Qwen3-4B-DFlash-mlx \
	--port 8000 > dflash.log 2>&1 &
	```

	### Query the server

	```bash
	# Chat completion
	curl http://localhost:8000/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "qwen3-4b",
	"messages": [{"role": "user", "content": "Explain quantum computing"}],
	"max_tokens": 512,
	"temperature": 0.0
	}'

	# Streaming
	curl http://localhost:8000/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "qwen3-4b",
	"messages": [{"role": "user", "content": "Count to 10"}],
	"max_tokens": 100,
	"stream": true
	}'

	# Check metrics
	curl http://localhost:8000/metrics
	```

	### Python client

	```python
	from openai import OpenAI

	client = OpenAI(
	base_url="http://localhost:8000/v1",
	api_key="not-needed", # Local server, no auth
	)

	response = client.chat.completions.create(
	model="qwen3-4b",
	messages=[{"role": "user", "content": "Write a haiku about ML"}],
	max_tokens=100,
	)
	print(response.choices[0].message.content)
	```

	---

	## 7️⃣ Using with Ollama, aider, Continue, etc.

	Any OpenAI-compatible client works:

	### aider (AI coding assistant)
	```bash
	aider --model openai/qwen3-4b \
	--openai-api-base http://localhost:8000/v1 \
	--openai-api-key not-needed
	```

	### Continue.dev (VS Code extension)
	```json
	// .continue/config.json
	{
	"models": [{
	"title": "DFlash Qwen3-4B",
	"provider": "openai",
	"model": "qwen3-4b",
	"apiBase": "http://localhost:8000/v1",
	"apiKey": "not-needed"
	}]
	}
	```

	### Ollama (as custom endpoint)
	Configure any OpenAI-compatible frontend to point at `http://localhost:8000/v1`

	---

	## 8️⃣ Supported Model Families

	\| Family \| Target Model Example \| Drafter Status \|
	\|--------\|---------------------\|---------------\|
	\| Qwen3 \| `mlx-community/Qwen3-4B-bf16` \| ✅ Pre-built \|
	\| Qwen3.5 \| `mlx-community/Qwen3.5-9B-4bit` \| ✅ Pre-built \|
	\| Qwen3.6 \| `mlx-community/Qwen3.6-27B-4bit` \| ✅ Pre-built \|
	\| LLaMA 3.1 \| `mlx-community/Llama-3.1-8B-Instruct-4bit` \| ✅ Pre-built \|
	\| LLaMA 3.3 \| `mlx-community/Llama-3.3-70B-Instruct-4bit` \| ✅ Pre-built \|
	\| Mistral \| `mlx-community/Mistral-7B-Instruct-v0.3-4bit` \| ⚠️ Train custom \|
	\| Gemma \| `mlx-community/gemma-4-31b-it-4bit` \| ✅ Pre-built \|
	\| Phi \| `mlx-community/Phi-3-mini-4k-instruct-4bit` \| ⚠️ Generic adapter \|

	---

	## 9️⃣ Troubleshooting

	### "Unsupported model_type: phi"
	```python
	# Add a custom adapter for your model
	from dflash_mlx.adapters import MLXTargetAdapter, ADAPTERS

	class PhiAdapter(MLXTargetAdapter):
	family = "phi"
	# Override methods as needed...

	ADAPTERS["phi"] = PhiAdapter
	```

	### "Vocab size mismatch"
	Ensure target model and draft model share the same tokenizer vocabulary. Drafters are trained for specific target families.

	### Slow first run
	MLX compiles Metal kernels lazily. First generation is slow; subsequent runs are fast. The benchmark method includes warmup.

	### Out of memory
	- Reduce `--block-size` (default 16 → 8)
	- Use 4-bit quantized target models (`-4bit` suffix)
	- Reduce `max_tokens`

	### Draft tokens all rejected
	- Drafter may not match target model (wrong family)
	- Use trained drafter for your specific model
	- Check `target_layer_ids` alignment in config

	---

	## 🔟 Full Example Script

	Save as `run_dflash.py`:

	```python
	#!/usr/bin/env python3
	"""Complete DFlash example with error handling."""

	import sys
	from mlx_lm import load
	from dflash_mlx import DFlashSpeculativeDecoder
	from dflash_mlx.convert import load_mlx_dflash

	def main():
	# Configuration
	TARGET_MODEL = "mlx-community/Qwen3-4B-bf16"
	DRAFT_MODEL = "~/models/dflash/Qwen3-4B-DFlash-mlx"
	PROMPT = "Explain how speculative decoding works."
	MAX_TOKENS = 512

	print(f"Loading target model: {TARGET_MODEL}")
	model, tokenizer = load(TARGET_MODEL)

	print(f"Loading DFlash drafter: {DRAFT_MODEL}")
	try:
	draft_model, draft_config = load_mlx_dflash(DRAFT_MODEL)
	except FileNotFoundError:
	print(f"Error: Drafter not found at {DRAFT_MODEL}")
	print("Convert first: uv run python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ~/models/dflash/Qwen3-4B-DFlash-mlx")
	sys.exit(1)

	print("Creating DFlash decoder...")
	decoder = DFlashSpeculativeDecoder(
	target_model=model,
	draft_model=draft_model,
	tokenizer=tokenizer,
	block_size=draft_config.get("block_size", 16),
	)

	print(f"\nPrompt: {PROMPT}")
	print("-" * 60)

	output = decoder.generate(
	prompt=PROMPT,
	max_tokens=MAX_TOKENS,
	temperature=0.0,
	)

	print(output)
	print("-" * 60)
	print("Done!")

	if __name__ == "__main__":
	main()
	```

	Run:
	```bash
	uv run python run_dflash.py
	```

	---

	## 🔄 Daily Workflow with `uv`

	```bash
	# cd into your project
	cd ~/projects/dflash-mlx-universal

	# Run any script — uv handles the virtual env automatically
	uv run python examples/qwen3_4b_demo.py

	# Run the server
	uv run python -m dflash_mlx.serve --target mlx-community/Qwen3-4B-bf16 --draft ./drafter --port 8000

	# Run tests
	uv run pytest tests/ -v

	# Format code
	uv run black dflash_mlx/

	# Lint
	uv run ruff check dflash_mlx/

	# Add a dependency
	uv add numpy>=1.26.0

	# Lock dependencies
	uv lock

	# Sync environment with lock file
	uv sync
	```

	---

	## 📚 Next Steps

	1. Install `uv` → `brew install uv`
	2. Clone repo → `git clone https://huggingface.co/tritesh/dflash-mlx-universal.git`
	3. Install → `cd dflash-mlx-universal && uv pip install -e ".[dev,server]"`
	4. Convert drafter → `uv run python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ./drafter`
	5. Benchmark → `uv run python examples/qwen3_4b_demo.py`
	6. Start server → `uv run python -m dflash_mlx.serve --target ... --draft ...`
	7. Connect tools → aider, Continue, custom clients
	8. Train custom drafters → For unsupported models using `UniversalDFlashDecoder`

	---

	For questions/issues: https://huggingface.co/tritesh/dflash-mlx-universal/discussions