Upload USAGE_GUIDE.md
Browse files- USAGE_GUIDE.md +166 -31
USAGE_GUIDE.md
CHANGED
|
@@ -15,10 +15,86 @@
|
|
| 15 |
|
| 16 |
---
|
| 17 |
|
| 18 |
-
## 1οΈβ£ Installation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
```bash
|
| 21 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
python3 -m venv .venv-dflash
|
| 23 |
source .venv-dflash/bin/activate # On zsh/bash
|
| 24 |
|
|
@@ -35,14 +111,6 @@ pip install git+https://huggingface.co/tritesh/dflash-mlx-universal.git
|
|
| 35 |
pip install fastapi uvicorn
|
| 36 |
```
|
| 37 |
|
| 38 |
-
### Alternative: Install from local clone
|
| 39 |
-
|
| 40 |
-
```bash
|
| 41 |
-
git clone https://huggingface.co/tritesh/dflash-mlx-universal.git
|
| 42 |
-
cd dflash-mlx-universal
|
| 43 |
-
pip install -e .
|
| 44 |
-
```
|
| 45 |
-
|
| 46 |
---
|
| 47 |
|
| 48 |
## 2οΈβ£ Quick Start β Using a Pre-converted Drafter
|
|
@@ -52,20 +120,39 @@ pip install -e .
|
|
| 52 |
Official drafters are PyTorch models. You need to convert them to MLX format once:
|
| 53 |
|
| 54 |
```bash
|
| 55 |
-
#
|
| 56 |
-
python -m dflash_mlx.convert \
|
| 57 |
--model z-lab/Qwen3-4B-DFlash-b16 \
|
| 58 |
--output ~/models/dflash/Qwen3-4B-DFlash-mlx
|
| 59 |
|
| 60 |
-
#
|
| 61 |
python -m dflash_mlx.convert \
|
| 62 |
-
--model z-lab/Qwen3
|
| 63 |
-
--output ~/models/dflash/Qwen3
|
|
|
|
| 64 |
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
```
|
| 70 |
|
| 71 |
**What this does:**
|
|
@@ -106,6 +193,11 @@ output = decoder.generate(
|
|
| 106 |
print(output)
|
| 107 |
```
|
| 108 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
**Expected output:**
|
| 110 |
```
|
| 111 |
[DFlash] Prefill: processing 12 prompt tokens...
|
|
@@ -163,6 +255,11 @@ print(f"Speedup: {results['speedup']:.2f}x")
|
|
| 163 |
print(f"Tokens/sec: {results['tokens_per_sec']:.1f}")
|
| 164 |
```
|
| 165 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 166 |
**Sample results (M2 Pro Max 96GB):**
|
| 167 |
```
|
| 168 |
[Benchmark] Baseline: 2.34s | DFlash: 0.41s | Speedup: 5.71x | 1247.6 tok/s
|
|
@@ -194,7 +291,7 @@ decoder = UniversalDFlashDecoder(
|
|
| 194 |
block_size=16,
|
| 195 |
)
|
| 196 |
|
| 197 |
-
# Option A: Train a custom drafter (2-8 hours)
|
| 198 |
decoder.train_drafter(
|
| 199 |
dataset="open-web-math", # or local JSONL
|
| 200 |
epochs=6,
|
|
@@ -217,15 +314,15 @@ output = decoder.generate(
|
|
| 217 |
Run a local server compatible with OpenAI clients:
|
| 218 |
|
| 219 |
```bash
|
| 220 |
-
#
|
| 221 |
-
python -m dflash_mlx.serve \
|
| 222 |
--target mlx-community/Qwen3-4B-bf16 \
|
| 223 |
--draft ~/models/dflash/Qwen3-4B-DFlash-mlx \
|
| 224 |
--block-size 16 \
|
| 225 |
--port 8000
|
| 226 |
|
| 227 |
# Or in background
|
| 228 |
-
nohup python -m dflash_mlx.serve \
|
| 229 |
--target mlx-community/Qwen3-4B-bf16 \
|
| 230 |
--draft ~/models/dflash/Qwen3-4B-DFlash-mlx \
|
| 231 |
--port 8000 > dflash.log 2>&1 &
|
|
@@ -284,7 +381,9 @@ Any OpenAI-compatible client works:
|
|
| 284 |
|
| 285 |
### aider (AI coding assistant)
|
| 286 |
```bash
|
| 287 |
-
aider --model openai/qwen3-4b
|
|
|
|
|
|
|
| 288 |
```
|
| 289 |
|
| 290 |
### Continue.dev (VS Code extension)
|
|
@@ -381,7 +480,7 @@ def main():
|
|
| 381 |
draft_model, draft_config = load_mlx_dflash(DRAFT_MODEL)
|
| 382 |
except FileNotFoundError:
|
| 383 |
print(f"Error: Drafter not found at {DRAFT_MODEL}")
|
| 384 |
-
print("Convert first: python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ~/models/dflash/Qwen3-4B-DFlash-mlx")
|
| 385 |
sys.exit(1)
|
| 386 |
|
| 387 |
print("Creating DFlash decoder...")
|
|
@@ -411,18 +510,54 @@ if __name__ == "__main__":
|
|
| 411 |
|
| 412 |
Run:
|
| 413 |
```bash
|
| 414 |
-
python run_dflash.py
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 415 |
```
|
| 416 |
|
| 417 |
---
|
| 418 |
|
| 419 |
## π Next Steps
|
| 420 |
|
| 421 |
-
1. **
|
| 422 |
-
2. **
|
| 423 |
-
3. **
|
| 424 |
-
4. **
|
| 425 |
-
5. **
|
|
|
|
|
|
|
|
|
|
| 426 |
|
| 427 |
---
|
| 428 |
|
|
|
|
| 15 |
|
| 16 |
---
|
| 17 |
|
| 18 |
+
## 1οΈβ£ Installation (Recommended: `uv`)
|
| 19 |
+
|
| 20 |
+
[`uv`](https://github.com/astral-sh/uv) is an extremely fast Python package manager written in Rust. It's the recommended way to install `dflash-mlx-universal`.
|
| 21 |
+
|
| 22 |
+
### Install `uv` (One-time)
|
| 23 |
+
|
| 24 |
+
```bash
|
| 25 |
+
# Option A: Homebrew (macOS)
|
| 26 |
+
brew install uv
|
| 27 |
+
|
| 28 |
+
# Option B: Official installer
|
| 29 |
+
curl -LsSf https://astral.sh/uv/install.sh | sh
|
| 30 |
+
|
| 31 |
+
# Verify
|
| 32 |
+
uv --version # Should show 0.6.x or higher
|
| 33 |
+
```
|
| 34 |
+
|
| 35 |
+
### Install DFlash-MLX-Universal with `uv`
|
| 36 |
+
|
| 37 |
+
```bash
|
| 38 |
+
# 1. Clone the repo
|
| 39 |
+
git clone https://huggingface.co/tritesh/dflash-mlx-universal.git
|
| 40 |
+
cd dflash-mlx-universal
|
| 41 |
+
|
| 42 |
+
# 2. Create virtual environment with uv (uses .python-version file)
|
| 43 |
+
uv venv
|
| 44 |
+
|
| 45 |
+
# 3. Install in editable mode with all dependencies
|
| 46 |
+
uv pip install -e ".[dev,server]"
|
| 47 |
+
|
| 48 |
+
# Or install directly from the repo
|
| 49 |
+
uv pip install "git+https://huggingface.co/tritesh/dflash-mlx-universal.git[dev,server]"
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
### Alternative: `uv` project workflow (no manual venv)
|
| 53 |
+
|
| 54 |
+
```bash
|
| 55 |
+
# 1. Enter project directory
|
| 56 |
+
cd dflash-mlx-universal
|
| 57 |
+
|
| 58 |
+
# 2. uv automatically reads pyproject.toml and .python-version
|
| 59 |
+
uv run python -c "import dflash_mlx; print(dflash_mlx.__version__)"
|
| 60 |
+
|
| 61 |
+
# 3. Lock dependencies (creates uv.lock)
|
| 62 |
+
uv lock
|
| 63 |
+
|
| 64 |
+
# 4. Run any script with automatic dependency resolution
|
| 65 |
+
uv run python examples/qwen3_4b_demo.py
|
| 66 |
+
|
| 67 |
+
# 5. Run tests
|
| 68 |
+
uv run pytest tests/ -v
|
| 69 |
+
|
| 70 |
+
# 6. Start server
|
| 71 |
+
uv run python -m dflash_mlx.serve --target mlx-community/Qwen3-4B-bf16 --draft ./Qwen3-4B-DFlash-mlx --port 8000
|
| 72 |
+
```
|
| 73 |
+
|
| 74 |
+
### With `uv` and dependency groups
|
| 75 |
|
| 76 |
```bash
|
| 77 |
+
# Install only core dependencies
|
| 78 |
+
uv pip install -e .
|
| 79 |
+
|
| 80 |
+
# Install with server extras (FastAPI + uvicorn)
|
| 81 |
+
uv pip install -e ".[server]"
|
| 82 |
+
|
| 83 |
+
# Install with dev extras (pytest, black, ruff)
|
| 84 |
+
uv pip install -e ".[dev]"
|
| 85 |
+
|
| 86 |
+
# Install everything at once
|
| 87 |
+
uv pip install -e ".[dev,server]"
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
---
|
| 91 |
+
|
| 92 |
+
## 1οΈβ£-alt Installation (Classic `pip`)
|
| 93 |
+
|
| 94 |
+
If you prefer `pip`:
|
| 95 |
+
|
| 96 |
+
```bash
|
| 97 |
+
# 1. Create virtual environment
|
| 98 |
python3 -m venv .venv-dflash
|
| 99 |
source .venv-dflash/bin/activate # On zsh/bash
|
| 100 |
|
|
|
|
| 111 |
pip install fastapi uvicorn
|
| 112 |
```
|
| 113 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 114 |
---
|
| 115 |
|
| 116 |
## 2οΈβ£ Quick Start β Using a Pre-converted Drafter
|
|
|
|
| 120 |
Official drafters are PyTorch models. You need to convert them to MLX format once:
|
| 121 |
|
| 122 |
```bash
|
| 123 |
+
# With uv (recommended)
|
| 124 |
+
uv run python -m dflash_mlx.convert \
|
| 125 |
--model z-lab/Qwen3-4B-DFlash-b16 \
|
| 126 |
--output ~/models/dflash/Qwen3-4B-DFlash-mlx
|
| 127 |
|
| 128 |
+
# With classic pip
|
| 129 |
python -m dflash_mlx.convert \
|
| 130 |
+
--model z-lab/Qwen3-4B-DFlash-b16 \
|
| 131 |
+
--output ~/models/dflash/Qwen3-4B-DFlash-mlx
|
| 132 |
+
```
|
| 133 |
|
| 134 |
+
**Supported drafters:**
|
| 135 |
+
```bash
|
| 136 |
+
# Qwen3 series
|
| 137 |
+
uv run python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ~/models/dflash/Qwen3-4B-DFlash-mlx
|
| 138 |
+
uv run python -m dflash_mlx.convert --model z-lab/Qwen3-8B-DFlash-b16 --output ~/models/dflash/Qwen3-8B-DFlash-mlx
|
| 139 |
+
|
| 140 |
+
# Qwen3.5 series
|
| 141 |
+
uv run python -m dflash_mlx.convert --model z-lab/Qwen3.5-9B-DFlash --output ~/models/dflash/Qwen3.5-9B-DFlash-mlx
|
| 142 |
+
uv run python -m dflash_mlx.convert --model z-lab/Qwen3.5-27B-DFlash --output ~/models/dflash/Qwen3.5-27B-DFlash-mlx
|
| 143 |
+
|
| 144 |
+
# Qwen3.6 series
|
| 145 |
+
uv run python -m dflash_mlx.convert --model z-lab/Qwen3.6-27B-DFlash --output ~/models/dflash/Qwen3.6-27B-DFlash-mlx
|
| 146 |
+
uv run python -m dflash_mlx.convert --model z-lab/Qwen3.6-35B-A3B-DFlash --output ~/models/dflash/Qwen3.6-35B-DFlash-mlx
|
| 147 |
+
|
| 148 |
+
# LLaMA
|
| 149 |
+
uv run python -m dflash_mlx.convert --model z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat --output ~/models/dflash/LLaMA3.1-8B-DFlash-mlx
|
| 150 |
+
|
| 151 |
+
# Gemma
|
| 152 |
+
uv run python -m dflash_mlx.convert --model z-lab/gemma-4-31B-it-DFlash --output ~/models/dflash/gemma-4-31B-DFlash-mlx
|
| 153 |
+
|
| 154 |
+
# GPT-OSS
|
| 155 |
+
uv run python -m dflash_mlx.convert --model z-lab/gpt-oss-20b-DFlash --output ~/models/dflash/gpt-oss-20b-DFlash-mlx
|
| 156 |
```
|
| 157 |
|
| 158 |
**What this does:**
|
|
|
|
| 193 |
print(output)
|
| 194 |
```
|
| 195 |
|
| 196 |
+
Run with `uv`:
|
| 197 |
+
```bash
|
| 198 |
+
uv run python my_generate_script.py
|
| 199 |
+
```
|
| 200 |
+
|
| 201 |
**Expected output:**
|
| 202 |
```
|
| 203 |
[DFlash] Prefill: processing 12 prompt tokens...
|
|
|
|
| 255 |
print(f"Tokens/sec: {results['tokens_per_sec']:.1f}")
|
| 256 |
```
|
| 257 |
|
| 258 |
+
Run:
|
| 259 |
+
```bash
|
| 260 |
+
uv run python benchmark_script.py
|
| 261 |
+
```
|
| 262 |
+
|
| 263 |
**Sample results (M2 Pro Max 96GB):**
|
| 264 |
```
|
| 265 |
[Benchmark] Baseline: 2.34s | DFlash: 0.41s | Speedup: 5.71x | 1247.6 tok/s
|
|
|
|
| 291 |
block_size=16,
|
| 292 |
)
|
| 293 |
|
| 294 |
+
# Option A: Train a custom drafter (2-8 hours on Apple Silicon)
|
| 295 |
decoder.train_drafter(
|
| 296 |
dataset="open-web-math", # or local JSONL
|
| 297 |
epochs=6,
|
|
|
|
| 314 |
Run a local server compatible with OpenAI clients:
|
| 315 |
|
| 316 |
```bash
|
| 317 |
+
# With uv (recommended)
|
| 318 |
+
uv run python -m dflash_mlx.serve \
|
| 319 |
--target mlx-community/Qwen3-4B-bf16 \
|
| 320 |
--draft ~/models/dflash/Qwen3-4B-DFlash-mlx \
|
| 321 |
--block-size 16 \
|
| 322 |
--port 8000
|
| 323 |
|
| 324 |
# Or in background
|
| 325 |
+
nohup uv run python -m dflash_mlx.serve \
|
| 326 |
--target mlx-community/Qwen3-4B-bf16 \
|
| 327 |
--draft ~/models/dflash/Qwen3-4B-DFlash-mlx \
|
| 328 |
--port 8000 > dflash.log 2>&1 &
|
|
|
|
| 381 |
|
| 382 |
### aider (AI coding assistant)
|
| 383 |
```bash
|
| 384 |
+
aider --model openai/qwen3-4b \
|
| 385 |
+
--openai-api-base http://localhost:8000/v1 \
|
| 386 |
+
--openai-api-key not-needed
|
| 387 |
```
|
| 388 |
|
| 389 |
### Continue.dev (VS Code extension)
|
|
|
|
| 480 |
draft_model, draft_config = load_mlx_dflash(DRAFT_MODEL)
|
| 481 |
except FileNotFoundError:
|
| 482 |
print(f"Error: Drafter not found at {DRAFT_MODEL}")
|
| 483 |
+
print("Convert first: uv run python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ~/models/dflash/Qwen3-4B-DFlash-mlx")
|
| 484 |
sys.exit(1)
|
| 485 |
|
| 486 |
print("Creating DFlash decoder...")
|
|
|
|
| 510 |
|
| 511 |
Run:
|
| 512 |
```bash
|
| 513 |
+
uv run python run_dflash.py
|
| 514 |
+
```
|
| 515 |
+
|
| 516 |
+
---
|
| 517 |
+
|
| 518 |
+
## π Daily Workflow with `uv`
|
| 519 |
+
|
| 520 |
+
```bash
|
| 521 |
+
# cd into your project
|
| 522 |
+
cd ~/projects/dflash-mlx-universal
|
| 523 |
+
|
| 524 |
+
# Run any script β uv handles the virtual env automatically
|
| 525 |
+
uv run python examples/qwen3_4b_demo.py
|
| 526 |
+
|
| 527 |
+
# Run the server
|
| 528 |
+
uv run python -m dflash_mlx.serve --target mlx-community/Qwen3-4B-bf16 --draft ./drafter --port 8000
|
| 529 |
+
|
| 530 |
+
# Run tests
|
| 531 |
+
uv run pytest tests/ -v
|
| 532 |
+
|
| 533 |
+
# Format code
|
| 534 |
+
uv run black dflash_mlx/
|
| 535 |
+
|
| 536 |
+
# Lint
|
| 537 |
+
uv run ruff check dflash_mlx/
|
| 538 |
+
|
| 539 |
+
# Add a dependency
|
| 540 |
+
uv add numpy>=1.26.0
|
| 541 |
+
|
| 542 |
+
# Lock dependencies
|
| 543 |
+
uv lock
|
| 544 |
+
|
| 545 |
+
# Sync environment with lock file
|
| 546 |
+
uv sync
|
| 547 |
```
|
| 548 |
|
| 549 |
---
|
| 550 |
|
| 551 |
## π Next Steps
|
| 552 |
|
| 553 |
+
1. **Install `uv`** β `brew install uv`
|
| 554 |
+
2. **Clone repo** β `git clone https://huggingface.co/tritesh/dflash-mlx-universal.git`
|
| 555 |
+
3. **Install** β `cd dflash-mlx-universal && uv pip install -e ".[dev,server]"`
|
| 556 |
+
4. **Convert drafter** β `uv run python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ./drafter`
|
| 557 |
+
5. **Benchmark** β `uv run python examples/qwen3_4b_demo.py`
|
| 558 |
+
6. **Start server** β `uv run python -m dflash_mlx.serve --target ... --draft ...`
|
| 559 |
+
7. **Connect tools** β aider, Continue, custom clients
|
| 560 |
+
8. **Train custom drafters** β For unsupported models using `UniversalDFlashDecoder`
|
| 561 |
|
| 562 |
---
|
| 563 |
|