tritesh
/

dflash-mlx-universal

@@ -15,10 +15,86 @@
 ---
-## 1️⃣ Installation
 ```bash
-# 1. Create a virtual environment (recommended)
 python3 -m venv .venv-dflash
 source .venv-dflash/bin/activate  # On zsh/bash
@@ -35,14 +111,6 @@ pip install git+https://huggingface.co/tritesh/dflash-mlx-universal.git
 pip install fastapi uvicorn
 ```
-### Alternative: Install from local clone
-```bash
-git clone https://huggingface.co/tritesh/dflash-mlx-universal.git
-cd dflash-mlx-universal
-pip install -e .
-```
 ---
 ## 2️⃣ Quick Start — Using a Pre-converted Drafter
@@ -52,20 +120,39 @@ pip install -e .
 Official drafters are PyTorch models. You need to convert them to MLX format once:
 ```bash
-# Convert Qwen3-4B drafter (~2-4 minutes on M2 Pro Max)
-python -m dflash_mlx.convert \
     --model z-lab/Qwen3-4B-DFlash-b16 \
     --output ~/models/dflash/Qwen3-4B-DFlash-mlx
-# Convert Qwen3.5-9B drafter
 python -m dflash_mlx.convert \
-    --model z-lab/Qwen3.5-9B-DFlash \
-    --output ~/models/dflash/Qwen3.5-9B-DFlash-mlx
-# Convert LLaMA-3.1-8B drafter
-python -m dflash_mlx.convert \
-    --model z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat \
-    --output ~/models/dflash/LLaMA3.1-8B-DFlash-mlx
 ```
 **What this does:**
@@ -106,6 +193,11 @@ output = decoder.generate(
 print(output)
 ```
 **Expected output:**
 ```
 [DFlash] Prefill: processing 12 prompt tokens...
@@ -163,6 +255,11 @@ print(f"Speedup: {results['speedup']:.2f}x")
 print(f"Tokens/sec: {results['tokens_per_sec']:.1f}")
 ```
 **Sample results (M2 Pro Max 96GB):**
 ```
 [Benchmark] Baseline: 2.34s | DFlash: 0.41s | Speedup: 5.71x | 1247.6 tok/s
@@ -194,7 +291,7 @@ decoder = UniversalDFlashDecoder(
     block_size=16,
 )
-# Option A: Train a custom drafter (2-8 hours)
 decoder.train_drafter(
     dataset="open-web-math",  # or local JSONL
     epochs=6,
@@ -217,15 +314,15 @@ output = decoder.generate(
 Run a local server compatible with OpenAI clients:
 ```bash
-# Start server
-python -m dflash_mlx.serve \
     --target mlx-community/Qwen3-4B-bf16 \
     --draft ~/models/dflash/Qwen3-4B-DFlash-mlx \
     --block-size 16 \
     --port 8000
 # Or in background
-nohup python -m dflash_mlx.serve \
     --target mlx-community/Qwen3-4B-bf16 \
     --draft ~/models/dflash/Qwen3-4B-DFlash-mlx \
     --port 8000 > dflash.log 2>&1 &
@@ -284,7 +381,9 @@ Any OpenAI-compatible client works:
 ### aider (AI coding assistant)
 ```bash
-aider --model openai/qwen3-4b --openai-api-base http://localhost:8000/v1 --openai-api-key not-needed
 ```
 ### Continue.dev (VS Code extension)
@@ -381,7 +480,7 @@ def main():
         draft_model, draft_config = load_mlx_dflash(DRAFT_MODEL)
     except FileNotFoundError:
         print(f"Error: Drafter not found at {DRAFT_MODEL}")
-        print("Convert first: python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ~/models/dflash/Qwen3-4B-DFlash-mlx")
         sys.exit(1)
     print("Creating DFlash decoder...")
@@ -411,18 +510,54 @@ if __name__ == "__main__":
 Run:
 ```bash
-python run_dflash.py
 ```
 ---
 ## 📚 Next Steps
-1. **Convert your first drafter** → `python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ./drafter`
-2. **Benchmark it** → Use `decoder.benchmark(...)` to verify speedup
-3. **Start the server** → `python -m dflash_mlx.serve --target ... --draft ...`
-4. **Connect your tools** → aider, Continue, custom clients
-5. **Train custom drafters** → For unsupported models using `UniversalDFlashDecoder`
 ---

 ---
+## 1️⃣ Installation (Recommended: `uv`)
+[`uv`](https://github.com/astral-sh/uv) is an extremely fast Python package manager written in Rust. It's the recommended way to install `dflash-mlx-universal`.
+### Install `uv` (One-time)
+```bash
+# Option A: Homebrew (macOS)
+brew install uv
+# Option B: Official installer
+curl -LsSf https://astral.sh/uv/install.sh | sh
+# Verify
+uv --version  # Should show 0.6.x or higher
+```
+### Install DFlash-MLX-Universal with `uv`
+```bash
+# 1. Clone the repo
+git clone https://huggingface.co/tritesh/dflash-mlx-universal.git
+cd dflash-mlx-universal
+# 2. Create virtual environment with uv (uses .python-version file)
+uv venv
+# 3. Install in editable mode with all dependencies
+uv pip install -e ".[dev,server]"
+# Or install directly from the repo
+uv pip install "git+https://huggingface.co/tritesh/dflash-mlx-universal.git[dev,server]"
+```
+### Alternative: `uv` project workflow (no manual venv)
+```bash
+# 1. Enter project directory
+cd dflash-mlx-universal
+# 2. uv automatically reads pyproject.toml and .python-version
+uv run python -c "import dflash_mlx; print(dflash_mlx.__version__)"
+# 3. Lock dependencies (creates uv.lock)
+uv lock
+# 4. Run any script with automatic dependency resolution
+uv run python examples/qwen3_4b_demo.py
+# 5. Run tests
+uv run pytest tests/ -v
+# 6. Start server
+uv run python -m dflash_mlx.serve --target mlx-community/Qwen3-4B-bf16 --draft ./Qwen3-4B-DFlash-mlx --port 8000
+```
+### With `uv` and dependency groups
 ```bash
+# Install only core dependencies
+uv pip install -e .
+# Install with server extras (FastAPI + uvicorn)
+uv pip install -e ".[server]"
+# Install with dev extras (pytest, black, ruff)
+uv pip install -e ".[dev]"
+# Install everything at once
+uv pip install -e ".[dev,server]"
+```
+---
+## 1️⃣-alt Installation (Classic `pip`)
+If you prefer `pip`:
+```bash
+# 1. Create virtual environment
 python3 -m venv .venv-dflash
 source .venv-dflash/bin/activate  # On zsh/bash
 pip install fastapi uvicorn
 ```
 ---
 ## 2️⃣ Quick Start — Using a Pre-converted Drafter
 Official drafters are PyTorch models. You need to convert them to MLX format once:
 ```bash
+# With uv (recommended)
+uv run python -m dflash_mlx.convert \
     --model z-lab/Qwen3-4B-DFlash-b16 \
     --output ~/models/dflash/Qwen3-4B-DFlash-mlx
+# With classic pip
 python -m dflash_mlx.convert \
+    --model z-lab/Qwen3-4B-DFlash-b16 \
+    --output ~/models/dflash/Qwen3-4B-DFlash-mlx
+```
+**Supported drafters:**
+```bash
+# Qwen3 series
+uv run python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16  --output ~/models/dflash/Qwen3-4B-DFlash-mlx
+uv run python -m dflash_mlx.convert --model z-lab/Qwen3-8B-DFlash-b16  --output ~/models/dflash/Qwen3-8B-DFlash-mlx
+# Qwen3.5 series
+uv run python -m dflash_mlx.convert --model z-lab/Qwen3.5-9B-DFlash    --output ~/models/dflash/Qwen3.5-9B-DFlash-mlx
+uv run python -m dflash_mlx.convert --model z-lab/Qwen3.5-27B-DFlash   --output ~/models/dflash/Qwen3.5-27B-DFlash-mlx
+# Qwen3.6 series
+uv run python -m dflash_mlx.convert --model z-lab/Qwen3.6-27B-DFlash   --output ~/models/dflash/Qwen3.6-27B-DFlash-mlx
+uv run python -m dflash_mlx.convert --model z-lab/Qwen3.6-35B-A3B-DFlash --output ~/models/dflash/Qwen3.6-35B-DFlash-mlx
+# LLaMA
+uv run python -m dflash_mlx.convert --model z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat --output ~/models/dflash/LLaMA3.1-8B-DFlash-mlx
+# Gemma
+uv run python -m dflash_mlx.convert --model z-lab/gemma-4-31B-it-DFlash --output ~/models/dflash/gemma-4-31B-DFlash-mlx
+# GPT-OSS
+uv run python -m dflash_mlx.convert --model z-lab/gpt-oss-20b-DFlash --output ~/models/dflash/gpt-oss-20b-DFlash-mlx
 ```
 **What this does:**
 print(output)
 ```
+Run with `uv`:
+```bash
+uv run python my_generate_script.py
+```
 **Expected output:**
 ```
 [DFlash] Prefill: processing 12 prompt tokens...
 print(f"Tokens/sec: {results['tokens_per_sec']:.1f}")
 ```
+Run:
+```bash
+uv run python benchmark_script.py
+```
 **Sample results (M2 Pro Max 96GB):**
 ```
 [Benchmark] Baseline: 2.34s | DFlash: 0.41s | Speedup: 5.71x | 1247.6 tok/s
     block_size=16,
 )
+# Option A: Train a custom drafter (2-8 hours on Apple Silicon)
 decoder.train_drafter(
     dataset="open-web-math",  # or local JSONL
     epochs=6,
 Run a local server compatible with OpenAI clients:
 ```bash
+# With uv (recommended)
+uv run python -m dflash_mlx.serve \
     --target mlx-community/Qwen3-4B-bf16 \
     --draft ~/models/dflash/Qwen3-4B-DFlash-mlx \
     --block-size 16 \
     --port 8000
 # Or in background
+nohup uv run python -m dflash_mlx.serve \
     --target mlx-community/Qwen3-4B-bf16 \
     --draft ~/models/dflash/Qwen3-4B-DFlash-mlx \
     --port 8000 > dflash.log 2>&1 &
 ### aider (AI coding assistant)
 ```bash
+aider --model openai/qwen3-4b \
+    --openai-api-base http://localhost:8000/v1 \
+    --openai-api-key not-needed
 ```
 ### Continue.dev (VS Code extension)
         draft_model, draft_config = load_mlx_dflash(DRAFT_MODEL)
     except FileNotFoundError:
         print(f"Error: Drafter not found at {DRAFT_MODEL}")
+        print("Convert first: uv run python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ~/models/dflash/Qwen3-4B-DFlash-mlx")
         sys.exit(1)
     print("Creating DFlash decoder...")
 Run:
 ```bash
+uv run python run_dflash.py
+```
+---
+## 🔄 Daily Workflow with `uv`
+```bash
+# cd into your project
+cd ~/projects/dflash-mlx-universal
+# Run any script — uv handles the virtual env automatically
+uv run python examples/qwen3_4b_demo.py
+# Run the server
+uv run python -m dflash_mlx.serve --target mlx-community/Qwen3-4B-bf16 --draft ./drafter --port 8000
+# Run tests
+uv run pytest tests/ -v
+# Format code
+uv run black dflash_mlx/
+# Lint
+uv run ruff check dflash_mlx/
+# Add a dependency
+uv add numpy>=1.26.0
+# Lock dependencies
+uv lock
+# Sync environment with lock file
+uv sync
 ```
 ---
 ## 📚 Next Steps
+1. **Install `uv`** → `brew install uv`
+2. **Clone repo** → `git clone https://huggingface.co/tritesh/dflash-mlx-universal.git`
+3. **Install** → `cd dflash-mlx-universal && uv pip install -e ".[dev,server]"`
+4. **Convert drafter** → `uv run python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ./drafter`
+5. **Benchmark** → `uv run python examples/qwen3_4b_demo.py`
+6. **Start server** → `uv run python -m dflash_mlx.serve --target ... --draft ...`
+7. **Connect tools** → aider, Continue, custom clients
+8. **Train custom drafters** → For unsupported models using `UniversalDFlashDecoder`
 ---