tritesh commited on
Commit
937c2a6
Β·
verified Β·
1 Parent(s): 4ce0ad0

Upload USAGE_GUIDE.md

Browse files
Files changed (1) hide show
  1. USAGE_GUIDE.md +166 -31
USAGE_GUIDE.md CHANGED
@@ -15,10 +15,86 @@
15
 
16
  ---
17
 
18
- ## 1️⃣ Installation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
  ```bash
21
- # 1. Create a virtual environment (recommended)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
  python3 -m venv .venv-dflash
23
  source .venv-dflash/bin/activate # On zsh/bash
24
 
@@ -35,14 +111,6 @@ pip install git+https://huggingface.co/tritesh/dflash-mlx-universal.git
35
  pip install fastapi uvicorn
36
  ```
37
 
38
- ### Alternative: Install from local clone
39
-
40
- ```bash
41
- git clone https://huggingface.co/tritesh/dflash-mlx-universal.git
42
- cd dflash-mlx-universal
43
- pip install -e .
44
- ```
45
-
46
  ---
47
 
48
  ## 2️⃣ Quick Start β€” Using a Pre-converted Drafter
@@ -52,20 +120,39 @@ pip install -e .
52
  Official drafters are PyTorch models. You need to convert them to MLX format once:
53
 
54
  ```bash
55
- # Convert Qwen3-4B drafter (~2-4 minutes on M2 Pro Max)
56
- python -m dflash_mlx.convert \
57
  --model z-lab/Qwen3-4B-DFlash-b16 \
58
  --output ~/models/dflash/Qwen3-4B-DFlash-mlx
59
 
60
- # Convert Qwen3.5-9B drafter
61
  python -m dflash_mlx.convert \
62
- --model z-lab/Qwen3.5-9B-DFlash \
63
- --output ~/models/dflash/Qwen3.5-9B-DFlash-mlx
 
64
 
65
- # Convert LLaMA-3.1-8B drafter
66
- python -m dflash_mlx.convert \
67
- --model z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat \
68
- --output ~/models/dflash/LLaMA3.1-8B-DFlash-mlx
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
  ```
70
 
71
  **What this does:**
@@ -106,6 +193,11 @@ output = decoder.generate(
106
  print(output)
107
  ```
108
 
 
 
 
 
 
109
  **Expected output:**
110
  ```
111
  [DFlash] Prefill: processing 12 prompt tokens...
@@ -163,6 +255,11 @@ print(f"Speedup: {results['speedup']:.2f}x")
163
  print(f"Tokens/sec: {results['tokens_per_sec']:.1f}")
164
  ```
165
 
 
 
 
 
 
166
  **Sample results (M2 Pro Max 96GB):**
167
  ```
168
  [Benchmark] Baseline: 2.34s | DFlash: 0.41s | Speedup: 5.71x | 1247.6 tok/s
@@ -194,7 +291,7 @@ decoder = UniversalDFlashDecoder(
194
  block_size=16,
195
  )
196
 
197
- # Option A: Train a custom drafter (2-8 hours)
198
  decoder.train_drafter(
199
  dataset="open-web-math", # or local JSONL
200
  epochs=6,
@@ -217,15 +314,15 @@ output = decoder.generate(
217
  Run a local server compatible with OpenAI clients:
218
 
219
  ```bash
220
- # Start server
221
- python -m dflash_mlx.serve \
222
  --target mlx-community/Qwen3-4B-bf16 \
223
  --draft ~/models/dflash/Qwen3-4B-DFlash-mlx \
224
  --block-size 16 \
225
  --port 8000
226
 
227
  # Or in background
228
- nohup python -m dflash_mlx.serve \
229
  --target mlx-community/Qwen3-4B-bf16 \
230
  --draft ~/models/dflash/Qwen3-4B-DFlash-mlx \
231
  --port 8000 > dflash.log 2>&1 &
@@ -284,7 +381,9 @@ Any OpenAI-compatible client works:
284
 
285
  ### aider (AI coding assistant)
286
  ```bash
287
- aider --model openai/qwen3-4b --openai-api-base http://localhost:8000/v1 --openai-api-key not-needed
 
 
288
  ```
289
 
290
  ### Continue.dev (VS Code extension)
@@ -381,7 +480,7 @@ def main():
381
  draft_model, draft_config = load_mlx_dflash(DRAFT_MODEL)
382
  except FileNotFoundError:
383
  print(f"Error: Drafter not found at {DRAFT_MODEL}")
384
- print("Convert first: python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ~/models/dflash/Qwen3-4B-DFlash-mlx")
385
  sys.exit(1)
386
 
387
  print("Creating DFlash decoder...")
@@ -411,18 +510,54 @@ if __name__ == "__main__":
411
 
412
  Run:
413
  ```bash
414
- python run_dflash.py
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
415
  ```
416
 
417
  ---
418
 
419
  ## πŸ“š Next Steps
420
 
421
- 1. **Convert your first drafter** β†’ `python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ./drafter`
422
- 2. **Benchmark it** β†’ Use `decoder.benchmark(...)` to verify speedup
423
- 3. **Start the server** β†’ `python -m dflash_mlx.serve --target ... --draft ...`
424
- 4. **Connect your tools** β†’ aider, Continue, custom clients
425
- 5. **Train custom drafters** β†’ For unsupported models using `UniversalDFlashDecoder`
 
 
 
426
 
427
  ---
428
 
 
15
 
16
  ---
17
 
18
+ ## 1️⃣ Installation (Recommended: `uv`)
19
+
20
+ [`uv`](https://github.com/astral-sh/uv) is an extremely fast Python package manager written in Rust. It's the recommended way to install `dflash-mlx-universal`.
21
+
22
+ ### Install `uv` (One-time)
23
+
24
+ ```bash
25
+ # Option A: Homebrew (macOS)
26
+ brew install uv
27
+
28
+ # Option B: Official installer
29
+ curl -LsSf https://astral.sh/uv/install.sh | sh
30
+
31
+ # Verify
32
+ uv --version # Should show 0.6.x or higher
33
+ ```
34
+
35
+ ### Install DFlash-MLX-Universal with `uv`
36
+
37
+ ```bash
38
+ # 1. Clone the repo
39
+ git clone https://huggingface.co/tritesh/dflash-mlx-universal.git
40
+ cd dflash-mlx-universal
41
+
42
+ # 2. Create virtual environment with uv (uses .python-version file)
43
+ uv venv
44
+
45
+ # 3. Install in editable mode with all dependencies
46
+ uv pip install -e ".[dev,server]"
47
+
48
+ # Or install directly from the repo
49
+ uv pip install "git+https://huggingface.co/tritesh/dflash-mlx-universal.git[dev,server]"
50
+ ```
51
+
52
+ ### Alternative: `uv` project workflow (no manual venv)
53
+
54
+ ```bash
55
+ # 1. Enter project directory
56
+ cd dflash-mlx-universal
57
+
58
+ # 2. uv automatically reads pyproject.toml and .python-version
59
+ uv run python -c "import dflash_mlx; print(dflash_mlx.__version__)"
60
+
61
+ # 3. Lock dependencies (creates uv.lock)
62
+ uv lock
63
+
64
+ # 4. Run any script with automatic dependency resolution
65
+ uv run python examples/qwen3_4b_demo.py
66
+
67
+ # 5. Run tests
68
+ uv run pytest tests/ -v
69
+
70
+ # 6. Start server
71
+ uv run python -m dflash_mlx.serve --target mlx-community/Qwen3-4B-bf16 --draft ./Qwen3-4B-DFlash-mlx --port 8000
72
+ ```
73
+
74
+ ### With `uv` and dependency groups
75
 
76
  ```bash
77
+ # Install only core dependencies
78
+ uv pip install -e .
79
+
80
+ # Install with server extras (FastAPI + uvicorn)
81
+ uv pip install -e ".[server]"
82
+
83
+ # Install with dev extras (pytest, black, ruff)
84
+ uv pip install -e ".[dev]"
85
+
86
+ # Install everything at once
87
+ uv pip install -e ".[dev,server]"
88
+ ```
89
+
90
+ ---
91
+
92
+ ## 1️⃣-alt Installation (Classic `pip`)
93
+
94
+ If you prefer `pip`:
95
+
96
+ ```bash
97
+ # 1. Create virtual environment
98
  python3 -m venv .venv-dflash
99
  source .venv-dflash/bin/activate # On zsh/bash
100
 
 
111
  pip install fastapi uvicorn
112
  ```
113
 
 
 
 
 
 
 
 
 
114
  ---
115
 
116
  ## 2️⃣ Quick Start β€” Using a Pre-converted Drafter
 
120
  Official drafters are PyTorch models. You need to convert them to MLX format once:
121
 
122
  ```bash
123
+ # With uv (recommended)
124
+ uv run python -m dflash_mlx.convert \
125
  --model z-lab/Qwen3-4B-DFlash-b16 \
126
  --output ~/models/dflash/Qwen3-4B-DFlash-mlx
127
 
128
+ # With classic pip
129
  python -m dflash_mlx.convert \
130
+ --model z-lab/Qwen3-4B-DFlash-b16 \
131
+ --output ~/models/dflash/Qwen3-4B-DFlash-mlx
132
+ ```
133
 
134
+ **Supported drafters:**
135
+ ```bash
136
+ # Qwen3 series
137
+ uv run python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ~/models/dflash/Qwen3-4B-DFlash-mlx
138
+ uv run python -m dflash_mlx.convert --model z-lab/Qwen3-8B-DFlash-b16 --output ~/models/dflash/Qwen3-8B-DFlash-mlx
139
+
140
+ # Qwen3.5 series
141
+ uv run python -m dflash_mlx.convert --model z-lab/Qwen3.5-9B-DFlash --output ~/models/dflash/Qwen3.5-9B-DFlash-mlx
142
+ uv run python -m dflash_mlx.convert --model z-lab/Qwen3.5-27B-DFlash --output ~/models/dflash/Qwen3.5-27B-DFlash-mlx
143
+
144
+ # Qwen3.6 series
145
+ uv run python -m dflash_mlx.convert --model z-lab/Qwen3.6-27B-DFlash --output ~/models/dflash/Qwen3.6-27B-DFlash-mlx
146
+ uv run python -m dflash_mlx.convert --model z-lab/Qwen3.6-35B-A3B-DFlash --output ~/models/dflash/Qwen3.6-35B-DFlash-mlx
147
+
148
+ # LLaMA
149
+ uv run python -m dflash_mlx.convert --model z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat --output ~/models/dflash/LLaMA3.1-8B-DFlash-mlx
150
+
151
+ # Gemma
152
+ uv run python -m dflash_mlx.convert --model z-lab/gemma-4-31B-it-DFlash --output ~/models/dflash/gemma-4-31B-DFlash-mlx
153
+
154
+ # GPT-OSS
155
+ uv run python -m dflash_mlx.convert --model z-lab/gpt-oss-20b-DFlash --output ~/models/dflash/gpt-oss-20b-DFlash-mlx
156
  ```
157
 
158
  **What this does:**
 
193
  print(output)
194
  ```
195
 
196
+ Run with `uv`:
197
+ ```bash
198
+ uv run python my_generate_script.py
199
+ ```
200
+
201
  **Expected output:**
202
  ```
203
  [DFlash] Prefill: processing 12 prompt tokens...
 
255
  print(f"Tokens/sec: {results['tokens_per_sec']:.1f}")
256
  ```
257
 
258
+ Run:
259
+ ```bash
260
+ uv run python benchmark_script.py
261
+ ```
262
+
263
  **Sample results (M2 Pro Max 96GB):**
264
  ```
265
  [Benchmark] Baseline: 2.34s | DFlash: 0.41s | Speedup: 5.71x | 1247.6 tok/s
 
291
  block_size=16,
292
  )
293
 
294
+ # Option A: Train a custom drafter (2-8 hours on Apple Silicon)
295
  decoder.train_drafter(
296
  dataset="open-web-math", # or local JSONL
297
  epochs=6,
 
314
  Run a local server compatible with OpenAI clients:
315
 
316
  ```bash
317
+ # With uv (recommended)
318
+ uv run python -m dflash_mlx.serve \
319
  --target mlx-community/Qwen3-4B-bf16 \
320
  --draft ~/models/dflash/Qwen3-4B-DFlash-mlx \
321
  --block-size 16 \
322
  --port 8000
323
 
324
  # Or in background
325
+ nohup uv run python -m dflash_mlx.serve \
326
  --target mlx-community/Qwen3-4B-bf16 \
327
  --draft ~/models/dflash/Qwen3-4B-DFlash-mlx \
328
  --port 8000 > dflash.log 2>&1 &
 
381
 
382
  ### aider (AI coding assistant)
383
  ```bash
384
+ aider --model openai/qwen3-4b \
385
+ --openai-api-base http://localhost:8000/v1 \
386
+ --openai-api-key not-needed
387
  ```
388
 
389
  ### Continue.dev (VS Code extension)
 
480
  draft_model, draft_config = load_mlx_dflash(DRAFT_MODEL)
481
  except FileNotFoundError:
482
  print(f"Error: Drafter not found at {DRAFT_MODEL}")
483
+ print("Convert first: uv run python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ~/models/dflash/Qwen3-4B-DFlash-mlx")
484
  sys.exit(1)
485
 
486
  print("Creating DFlash decoder...")
 
510
 
511
  Run:
512
  ```bash
513
+ uv run python run_dflash.py
514
+ ```
515
+
516
+ ---
517
+
518
+ ## πŸ”„ Daily Workflow with `uv`
519
+
520
+ ```bash
521
+ # cd into your project
522
+ cd ~/projects/dflash-mlx-universal
523
+
524
+ # Run any script β€” uv handles the virtual env automatically
525
+ uv run python examples/qwen3_4b_demo.py
526
+
527
+ # Run the server
528
+ uv run python -m dflash_mlx.serve --target mlx-community/Qwen3-4B-bf16 --draft ./drafter --port 8000
529
+
530
+ # Run tests
531
+ uv run pytest tests/ -v
532
+
533
+ # Format code
534
+ uv run black dflash_mlx/
535
+
536
+ # Lint
537
+ uv run ruff check dflash_mlx/
538
+
539
+ # Add a dependency
540
+ uv add numpy>=1.26.0
541
+
542
+ # Lock dependencies
543
+ uv lock
544
+
545
+ # Sync environment with lock file
546
+ uv sync
547
  ```
548
 
549
  ---
550
 
551
  ## πŸ“š Next Steps
552
 
553
+ 1. **Install `uv`** β†’ `brew install uv`
554
+ 2. **Clone repo** β†’ `git clone https://huggingface.co/tritesh/dflash-mlx-universal.git`
555
+ 3. **Install** β†’ `cd dflash-mlx-universal && uv pip install -e ".[dev,server]"`
556
+ 4. **Convert drafter** β†’ `uv run python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ./drafter`
557
+ 5. **Benchmark** β†’ `uv run python examples/qwen3_4b_demo.py`
558
+ 6. **Start server** β†’ `uv run python -m dflash_mlx.serve --target ... --draft ...`
559
+ 7. **Connect tools** β†’ aider, Continue, custom clients
560
+ 8. **Train custom drafters** β†’ For unsupported models using `UniversalDFlashDecoder`
561
 
562
  ---
563