dflash-mlx-universal / USAGE_GUIDE.md
tritesh's picture
Upload USAGE_GUIDE.md
937c2a6 verified
# DFlash-MLX-Universal: System Usage Guide
> How to use `dflash-mlx-universal` on your Apple Silicon Mac (M1/M2/M3/M4)
---
## πŸ“‹ Prerequisites
| Requirement | Version | Notes |
|------------|---------|-------|
| macOS | 14+ (Sonoma/Sequoia) | MLX requires Apple Silicon |
| Python | 3.9 - 3.12 | Recommend 3.11 or 3.12 |
| Chip | M1/M2/M3/M4 (Pro/Max/Ultra) | Unified memory required for large models |
| Memory | 16GB+ minimum, 32GB+ recommended | 96GB for 70B+ models |
---
## 1️⃣ Installation (Recommended: `uv`)
[`uv`](https://github.com/astral-sh/uv) is an extremely fast Python package manager written in Rust. It's the recommended way to install `dflash-mlx-universal`.
### Install `uv` (One-time)
```bash
# Option A: Homebrew (macOS)
brew install uv
# Option B: Official installer
curl -LsSf https://astral.sh/uv/install.sh | sh
# Verify
uv --version # Should show 0.6.x or higher
```
### Install DFlash-MLX-Universal with `uv`
```bash
# 1. Clone the repo
git clone https://huggingface.co/tritesh/dflash-mlx-universal.git
cd dflash-mlx-universal
# 2. Create virtual environment with uv (uses .python-version file)
uv venv
# 3. Install in editable mode with all dependencies
uv pip install -e ".[dev,server]"
# Or install directly from the repo
uv pip install "git+https://huggingface.co/tritesh/dflash-mlx-universal.git[dev,server]"
```
### Alternative: `uv` project workflow (no manual venv)
```bash
# 1. Enter project directory
cd dflash-mlx-universal
# 2. uv automatically reads pyproject.toml and .python-version
uv run python -c "import dflash_mlx; print(dflash_mlx.__version__)"
# 3. Lock dependencies (creates uv.lock)
uv lock
# 4. Run any script with automatic dependency resolution
uv run python examples/qwen3_4b_demo.py
# 5. Run tests
uv run pytest tests/ -v
# 6. Start server
uv run python -m dflash_mlx.serve --target mlx-community/Qwen3-4B-bf16 --draft ./Qwen3-4B-DFlash-mlx --port 8000
```
### With `uv` and dependency groups
```bash
# Install only core dependencies
uv pip install -e .
# Install with server extras (FastAPI + uvicorn)
uv pip install -e ".[server]"
# Install with dev extras (pytest, black, ruff)
uv pip install -e ".[dev]"
# Install everything at once
uv pip install -e ".[dev,server]"
```
---
## 1️⃣-alt Installation (Classic `pip`)
If you prefer `pip`:
```bash
# 1. Create virtual environment
python3 -m venv .venv-dflash
source .venv-dflash/bin/activate # On zsh/bash
# 2. Upgrade pip
pip install --upgrade pip
# 3. Install core dependencies
pip install mlx-lm>=0.24.0 transformers>=4.57.0 huggingface-hub>=0.25.0
# 4. Install dflash-mlx-universal from your repo
pip install git+https://huggingface.co/tritesh/dflash-mlx-universal.git
# Optional: server mode
pip install fastapi uvicorn
```
---
## 2️⃣ Quick Start β€” Using a Pre-converted Drafter
### Step A: Convert an Official DFlash Drafter to MLX
Official drafters are PyTorch models. You need to convert them to MLX format once:
```bash
# With uv (recommended)
uv run python -m dflash_mlx.convert \
--model z-lab/Qwen3-4B-DFlash-b16 \
--output ~/models/dflash/Qwen3-4B-DFlash-mlx
# With classic pip
python -m dflash_mlx.convert \
--model z-lab/Qwen3-4B-DFlash-b16 \
--output ~/models/dflash/Qwen3-4B-DFlash-mlx
```
**Supported drafters:**
```bash
# Qwen3 series
uv run python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ~/models/dflash/Qwen3-4B-DFlash-mlx
uv run python -m dflash_mlx.convert --model z-lab/Qwen3-8B-DFlash-b16 --output ~/models/dflash/Qwen3-8B-DFlash-mlx
# Qwen3.5 series
uv run python -m dflash_mlx.convert --model z-lab/Qwen3.5-9B-DFlash --output ~/models/dflash/Qwen3.5-9B-DFlash-mlx
uv run python -m dflash_mlx.convert --model z-lab/Qwen3.5-27B-DFlash --output ~/models/dflash/Qwen3.5-27B-DFlash-mlx
# Qwen3.6 series
uv run python -m dflash_mlx.convert --model z-lab/Qwen3.6-27B-DFlash --output ~/models/dflash/Qwen3.6-27B-DFlash-mlx
uv run python -m dflash_mlx.convert --model z-lab/Qwen3.6-35B-A3B-DFlash --output ~/models/dflash/Qwen3.6-35B-DFlash-mlx
# LLaMA
uv run python -m dflash_mlx.convert --model z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat --output ~/models/dflash/LLaMA3.1-8B-DFlash-mlx
# Gemma
uv run python -m dflash_mlx.convert --model z-lab/gemma-4-31B-it-DFlash --output ~/models/dflash/gemma-4-31B-DFlash-mlx
# GPT-OSS
uv run python -m dflash_mlx.convert --model z-lab/gpt-oss-20b-DFlash --output ~/models/dflash/gpt-oss-20b-DFlash-mlx
```
**What this does:**
- Downloads PyTorch weights from HF Hub
- Transposes linear layers (PyTorch β†’ MLX format)
- Saves as `weights.npz` + `config.json`
- Creates `model_info.json` with target model mapping
---
### Step B: Generate with DFlash Speculative Decoding
```python
from mlx_lm import load
from dflash_mlx import DFlashSpeculativeDecoder
from dflash_mlx.convert import load_mlx_dflash
# 1. Load target model (any MLX-converted model)
model, tokenizer = load("mlx-community/Qwen3-4B-bf16")
# 2. Load converted DFlash drafter
draft_model, draft_config = load_mlx_dflash("~/models/dflash/Qwen3-4B-DFlash-mlx")
# 3. Create decoder (auto-detects architecture via adapters)
decoder = DFlashSpeculativeDecoder(
target_model=model,
draft_model=draft_model,
tokenizer=tokenizer,
block_size=draft_config.get("block_size", 16),
)
# 4. Generate with 6Γ— speedup
output = decoder.generate(
prompt="Write a Python function to implement quicksort.",
max_tokens=1024,
temperature=0.0, # Greedy for exact reproduction
)
print(output)
```
Run with `uv`:
```bash
uv run python my_generate_script.py
```
**Expected output:**
```
[DFlash] Prefill: processing 12 prompt tokens...
[DFlash] Starting speculative decoding (block_size=16)...
[DFlash] Done. Generated 1024 tokens, avg acceptance: 6.23, effective speedup: ~5.8x
```
---
## 3️⃣ Streaming Generation
For real-time output (chat UI, etc.):
```python
from mlx_lm import load
from dflash_mlx import DFlashSpeculativeDecoder
from dflash_mlx.convert import load_mlx_dflash
model, tokenizer = load("mlx-community/Qwen3-4B-bf16")
draft_model, _ = load_mlx_dflash("~/models/dflash/Qwen3-4B-DFlash-mlx")
decoder = DFlashSpeculativeDecoder(model, draft_model, tokenizer, block_size=16)
# Generator-based streaming
for chunk in decoder.generate(
prompt="Tell me a story about a robot.",
max_tokens=512,
stream=True, # ← Returns generator
):
print(chunk, end="", flush=True)
```
---
## 4️⃣ Benchmark Mode
Compare DFlash vs baseline speed:
```python
from mlx_lm import load
from dflash_mlx import DFlashSpeculativeDecoder
from dflash_mlx.convert import load_mlx_dflash
model, tokenizer = load("mlx-community/Qwen3-4B-bf16")
draft_model, _ = load_mlx_dflash("~/models/dflash/Qwen3-4B-DFlash-mlx")
decoder = DFlashSpeculativeDecoder(model, draft_model, tokenizer, block_size=16)
# Run benchmark
results = decoder.benchmark(
prompt="Write a quicksort in Python.",
max_tokens=512,
num_runs=5,
)
print(f"Speedup: {results['speedup']:.2f}x")
print(f"Tokens/sec: {results['tokens_per_sec']:.1f}")
```
Run:
```bash
uv run python benchmark_script.py
```
**Sample results (M2 Pro Max 96GB):**
```
[Benchmark] Baseline: 2.34s | DFlash: 0.41s | Speedup: 5.71x | 1247.6 tok/s
```
---
## 5️⃣ Universal Decoder (Any Model Without Pre-built Drafter)
If your model doesn't have a DFlash drafter yet:
```python
from mlx_lm import load
from dflash_mlx.universal import UniversalDFlashDecoder
# Load ANY mlx_lm model
model, tokenizer = load("mlx-community/Llama-3.1-8B-Instruct-4bit")
# UniversalDFlashDecoder:
# 1. Auto-detects architecture (LLaMA in this case)
# 2. Creates a generic 5-layer drafter (~500MB)
# 3. Sets up proper adapter for hidden state extraction
decoder = UniversalDFlashDecoder(
target_model=model,
tokenizer=tokenizer,
draft_layers=5,
draft_hidden_size=1024,
block_size=16,
)
# Option A: Train a custom drafter (2-8 hours on Apple Silicon)
decoder.train_drafter(
dataset="open-web-math", # or local JSONL
epochs=6,
lr=6e-4,
batch_size=16,
output_path="~/models/dflash/my-llama-drafter",
)
# Option B: Use untrained (low quality, for testing only)
output = decoder.generate(
prompt="Hello world!",
max_tokens=100,
)
```
---
## 6️⃣ OpenAI-Compatible Server
Run a local server compatible with OpenAI clients:
```bash
# With uv (recommended)
uv run python -m dflash_mlx.serve \
--target mlx-community/Qwen3-4B-bf16 \
--draft ~/models/dflash/Qwen3-4B-DFlash-mlx \
--block-size 16 \
--port 8000
# Or in background
nohup uv run python -m dflash_mlx.serve \
--target mlx-community/Qwen3-4B-bf16 \
--draft ~/models/dflash/Qwen3-4B-DFlash-mlx \
--port 8000 > dflash.log 2>&1 &
```
### Query the server
```bash
# Chat completion
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-4b",
"messages": [{"role": "user", "content": "Explain quantum computing"}],
"max_tokens": 512,
"temperature": 0.0
}'
# Streaming
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-4b",
"messages": [{"role": "user", "content": "Count to 10"}],
"max_tokens": 100,
"stream": true
}'
# Check metrics
curl http://localhost:8000/metrics
```
### Python client
```python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed", # Local server, no auth
)
response = client.chat.completions.create(
model="qwen3-4b",
messages=[{"role": "user", "content": "Write a haiku about ML"}],
max_tokens=100,
)
print(response.choices[0].message.content)
```
---
## 7️⃣ Using with Ollama, aider, Continue, etc.
Any OpenAI-compatible client works:
### aider (AI coding assistant)
```bash
aider --model openai/qwen3-4b \
--openai-api-base http://localhost:8000/v1 \
--openai-api-key not-needed
```
### Continue.dev (VS Code extension)
```json
// .continue/config.json
{
"models": [{
"title": "DFlash Qwen3-4B",
"provider": "openai",
"model": "qwen3-4b",
"apiBase": "http://localhost:8000/v1",
"apiKey": "not-needed"
}]
}
```
### Ollama (as custom endpoint)
Configure any OpenAI-compatible frontend to point at `http://localhost:8000/v1`
---
## 8️⃣ Supported Model Families
| Family | Target Model Example | Drafter Status |
|--------|---------------------|---------------|
| **Qwen3** | `mlx-community/Qwen3-4B-bf16` | βœ… Pre-built |
| **Qwen3.5** | `mlx-community/Qwen3.5-9B-4bit` | βœ… Pre-built |
| **Qwen3.6** | `mlx-community/Qwen3.6-27B-4bit` | βœ… Pre-built |
| **LLaMA 3.1** | `mlx-community/Llama-3.1-8B-Instruct-4bit` | βœ… Pre-built |
| **LLaMA 3.3** | `mlx-community/Llama-3.3-70B-Instruct-4bit` | βœ… Pre-built |
| **Mistral** | `mlx-community/Mistral-7B-Instruct-v0.3-4bit` | ⚠️ Train custom |
| **Gemma** | `mlx-community/gemma-4-31b-it-4bit` | βœ… Pre-built |
| **Phi** | `mlx-community/Phi-3-mini-4k-instruct-4bit` | ⚠️ Generic adapter |
---
## 9️⃣ Troubleshooting
### "Unsupported model_type: phi"
```python
# Add a custom adapter for your model
from dflash_mlx.adapters import MLXTargetAdapter, ADAPTERS
class PhiAdapter(MLXTargetAdapter):
family = "phi"
# Override methods as needed...
ADAPTERS["phi"] = PhiAdapter
```
### "Vocab size mismatch"
Ensure target model and draft model share the same tokenizer vocabulary. Drafters are trained for specific target families.
### Slow first run
MLX compiles Metal kernels lazily. First generation is slow; subsequent runs are fast. The benchmark method includes warmup.
### Out of memory
- Reduce `--block-size` (default 16 β†’ 8)
- Use 4-bit quantized target models (`-4bit` suffix)
- Reduce `max_tokens`
### Draft tokens all rejected
- Drafter may not match target model (wrong family)
- Use trained drafter for your specific model
- Check `target_layer_ids` alignment in config
---
## πŸ”Ÿ Full Example Script
Save as `run_dflash.py`:
```python
#!/usr/bin/env python3
"""Complete DFlash example with error handling."""
import sys
from mlx_lm import load
from dflash_mlx import DFlashSpeculativeDecoder
from dflash_mlx.convert import load_mlx_dflash
def main():
# Configuration
TARGET_MODEL = "mlx-community/Qwen3-4B-bf16"
DRAFT_MODEL = "~/models/dflash/Qwen3-4B-DFlash-mlx"
PROMPT = "Explain how speculative decoding works."
MAX_TOKENS = 512
print(f"Loading target model: {TARGET_MODEL}")
model, tokenizer = load(TARGET_MODEL)
print(f"Loading DFlash drafter: {DRAFT_MODEL}")
try:
draft_model, draft_config = load_mlx_dflash(DRAFT_MODEL)
except FileNotFoundError:
print(f"Error: Drafter not found at {DRAFT_MODEL}")
print("Convert first: uv run python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ~/models/dflash/Qwen3-4B-DFlash-mlx")
sys.exit(1)
print("Creating DFlash decoder...")
decoder = DFlashSpeculativeDecoder(
target_model=model,
draft_model=draft_model,
tokenizer=tokenizer,
block_size=draft_config.get("block_size", 16),
)
print(f"\nPrompt: {PROMPT}")
print("-" * 60)
output = decoder.generate(
prompt=PROMPT,
max_tokens=MAX_TOKENS,
temperature=0.0,
)
print(output)
print("-" * 60)
print("Done!")
if __name__ == "__main__":
main()
```
Run:
```bash
uv run python run_dflash.py
```
---
## πŸ”„ Daily Workflow with `uv`
```bash
# cd into your project
cd ~/projects/dflash-mlx-universal
# Run any script β€” uv handles the virtual env automatically
uv run python examples/qwen3_4b_demo.py
# Run the server
uv run python -m dflash_mlx.serve --target mlx-community/Qwen3-4B-bf16 --draft ./drafter --port 8000
# Run tests
uv run pytest tests/ -v
# Format code
uv run black dflash_mlx/
# Lint
uv run ruff check dflash_mlx/
# Add a dependency
uv add numpy>=1.26.0
# Lock dependencies
uv lock
# Sync environment with lock file
uv sync
```
---
## πŸ“š Next Steps
1. **Install `uv`** β†’ `brew install uv`
2. **Clone repo** β†’ `git clone https://huggingface.co/tritesh/dflash-mlx-universal.git`
3. **Install** β†’ `cd dflash-mlx-universal && uv pip install -e ".[dev,server]"`
4. **Convert drafter** β†’ `uv run python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ./drafter`
5. **Benchmark** β†’ `uv run python examples/qwen3_4b_demo.py`
6. **Start server** β†’ `uv run python -m dflash_mlx.serve --target ... --draft ...`
7. **Connect tools** β†’ aider, Continue, custom clients
8. **Train custom drafters** β†’ For unsupported models using `UniversalDFlashDecoder`
---
For questions/issues: https://huggingface.co/tritesh/dflash-mlx-universal/discussions