| # DFlash-MLX-Universal: System Usage Guide |
|
|
| > How to use `dflash-mlx-universal` on your Apple Silicon Mac (M1/M2/M3/M4) |
|
|
| --- |
|
|
| ## π Prerequisites |
|
|
| | Requirement | Version | Notes | |
| |------------|---------|-------| |
| | macOS | 14+ (Sonoma/Sequoia) | MLX requires Apple Silicon | |
| | Python | 3.9 - 3.12 | Recommend 3.11 or 3.12 | |
| | Chip | M1/M2/M3/M4 (Pro/Max/Ultra) | Unified memory required for large models | |
| | Memory | 16GB+ minimum, 32GB+ recommended | 96GB for 70B+ models | |
|
|
| --- |
|
|
| ## 1οΈβ£ Installation (Recommended: `uv`) |
|
|
| [`uv`](https://github.com/astral-sh/uv) is an extremely fast Python package manager written in Rust. It's the recommended way to install `dflash-mlx-universal`. |
|
|
| ### Install `uv` (One-time) |
|
|
| ```bash |
| # Option A: Homebrew (macOS) |
| brew install uv |
| |
| # Option B: Official installer |
| curl -LsSf https://astral.sh/uv/install.sh | sh |
| |
| # Verify |
| uv --version # Should show 0.6.x or higher |
| ``` |
|
|
| ### Install DFlash-MLX-Universal with `uv` |
|
|
| ```bash |
| # 1. Clone the repo |
| git clone https://huggingface.co/tritesh/dflash-mlx-universal.git |
| cd dflash-mlx-universal |
| |
| # 2. Create virtual environment with uv (uses .python-version file) |
| uv venv |
| |
| # 3. Install in editable mode with all dependencies |
| uv pip install -e ".[dev,server]" |
| |
| # Or install directly from the repo |
| uv pip install "git+https://huggingface.co/tritesh/dflash-mlx-universal.git[dev,server]" |
| ``` |
|
|
| ### Alternative: `uv` project workflow (no manual venv) |
|
|
| ```bash |
| # 1. Enter project directory |
| cd dflash-mlx-universal |
| |
| # 2. uv automatically reads pyproject.toml and .python-version |
| uv run python -c "import dflash_mlx; print(dflash_mlx.__version__)" |
| |
| # 3. Lock dependencies (creates uv.lock) |
| uv lock |
| |
| # 4. Run any script with automatic dependency resolution |
| uv run python examples/qwen3_4b_demo.py |
| |
| # 5. Run tests |
| uv run pytest tests/ -v |
| |
| # 6. Start server |
| uv run python -m dflash_mlx.serve --target mlx-community/Qwen3-4B-bf16 --draft ./Qwen3-4B-DFlash-mlx --port 8000 |
| ``` |
|
|
| ### With `uv` and dependency groups |
|
|
| ```bash |
| # Install only core dependencies |
| uv pip install -e . |
| |
| # Install with server extras (FastAPI + uvicorn) |
| uv pip install -e ".[server]" |
| |
| # Install with dev extras (pytest, black, ruff) |
| uv pip install -e ".[dev]" |
| |
| # Install everything at once |
| uv pip install -e ".[dev,server]" |
| ``` |
|
|
| --- |
|
|
| ## 1οΈβ£-alt Installation (Classic `pip`) |
|
|
| If you prefer `pip`: |
|
|
| ```bash |
| # 1. Create virtual environment |
| python3 -m venv .venv-dflash |
| source .venv-dflash/bin/activate # On zsh/bash |
| |
| # 2. Upgrade pip |
| pip install --upgrade pip |
| |
| # 3. Install core dependencies |
| pip install mlx-lm>=0.24.0 transformers>=4.57.0 huggingface-hub>=0.25.0 |
| |
| # 4. Install dflash-mlx-universal from your repo |
| pip install git+https://huggingface.co/tritesh/dflash-mlx-universal.git |
| |
| # Optional: server mode |
| pip install fastapi uvicorn |
| ``` |
|
|
| --- |
|
|
| ## 2οΈβ£ Quick Start β Using a Pre-converted Drafter |
|
|
| ### Step A: Convert an Official DFlash Drafter to MLX |
|
|
| Official drafters are PyTorch models. You need to convert them to MLX format once: |
|
|
| ```bash |
| # With uv (recommended) |
| uv run python -m dflash_mlx.convert \ |
| --model z-lab/Qwen3-4B-DFlash-b16 \ |
| --output ~/models/dflash/Qwen3-4B-DFlash-mlx |
| |
| # With classic pip |
| python -m dflash_mlx.convert \ |
| --model z-lab/Qwen3-4B-DFlash-b16 \ |
| --output ~/models/dflash/Qwen3-4B-DFlash-mlx |
| ``` |
|
|
| **Supported drafters:** |
| ```bash |
| # Qwen3 series |
| uv run python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ~/models/dflash/Qwen3-4B-DFlash-mlx |
| uv run python -m dflash_mlx.convert --model z-lab/Qwen3-8B-DFlash-b16 --output ~/models/dflash/Qwen3-8B-DFlash-mlx |
| |
| # Qwen3.5 series |
| uv run python -m dflash_mlx.convert --model z-lab/Qwen3.5-9B-DFlash --output ~/models/dflash/Qwen3.5-9B-DFlash-mlx |
| uv run python -m dflash_mlx.convert --model z-lab/Qwen3.5-27B-DFlash --output ~/models/dflash/Qwen3.5-27B-DFlash-mlx |
| |
| # Qwen3.6 series |
| uv run python -m dflash_mlx.convert --model z-lab/Qwen3.6-27B-DFlash --output ~/models/dflash/Qwen3.6-27B-DFlash-mlx |
| uv run python -m dflash_mlx.convert --model z-lab/Qwen3.6-35B-A3B-DFlash --output ~/models/dflash/Qwen3.6-35B-DFlash-mlx |
| |
| # LLaMA |
| uv run python -m dflash_mlx.convert --model z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat --output ~/models/dflash/LLaMA3.1-8B-DFlash-mlx |
| |
| # Gemma |
| uv run python -m dflash_mlx.convert --model z-lab/gemma-4-31B-it-DFlash --output ~/models/dflash/gemma-4-31B-DFlash-mlx |
| |
| # GPT-OSS |
| uv run python -m dflash_mlx.convert --model z-lab/gpt-oss-20b-DFlash --output ~/models/dflash/gpt-oss-20b-DFlash-mlx |
| ``` |
|
|
| **What this does:** |
| - Downloads PyTorch weights from HF Hub |
| - Transposes linear layers (PyTorch β MLX format) |
| - Saves as `weights.npz` + `config.json` |
| - Creates `model_info.json` with target model mapping |
|
|
| --- |
|
|
| ### Step B: Generate with DFlash Speculative Decoding |
|
|
| ```python |
| from mlx_lm import load |
| from dflash_mlx import DFlashSpeculativeDecoder |
| from dflash_mlx.convert import load_mlx_dflash |
| |
| # 1. Load target model (any MLX-converted model) |
| model, tokenizer = load("mlx-community/Qwen3-4B-bf16") |
| |
| # 2. Load converted DFlash drafter |
| draft_model, draft_config = load_mlx_dflash("~/models/dflash/Qwen3-4B-DFlash-mlx") |
| |
| # 3. Create decoder (auto-detects architecture via adapters) |
| decoder = DFlashSpeculativeDecoder( |
| target_model=model, |
| draft_model=draft_model, |
| tokenizer=tokenizer, |
| block_size=draft_config.get("block_size", 16), |
| ) |
| |
| # 4. Generate with 6Γ speedup |
| output = decoder.generate( |
| prompt="Write a Python function to implement quicksort.", |
| max_tokens=1024, |
| temperature=0.0, # Greedy for exact reproduction |
| ) |
| print(output) |
| ``` |
|
|
| Run with `uv`: |
| ```bash |
| uv run python my_generate_script.py |
| ``` |
|
|
| **Expected output:** |
| ``` |
| [DFlash] Prefill: processing 12 prompt tokens... |
| [DFlash] Starting speculative decoding (block_size=16)... |
| [DFlash] Done. Generated 1024 tokens, avg acceptance: 6.23, effective speedup: ~5.8x |
| ``` |
|
|
| --- |
|
|
| ## 3οΈβ£ Streaming Generation |
|
|
| For real-time output (chat UI, etc.): |
|
|
| ```python |
| from mlx_lm import load |
| from dflash_mlx import DFlashSpeculativeDecoder |
| from dflash_mlx.convert import load_mlx_dflash |
| |
| model, tokenizer = load("mlx-community/Qwen3-4B-bf16") |
| draft_model, _ = load_mlx_dflash("~/models/dflash/Qwen3-4B-DFlash-mlx") |
| decoder = DFlashSpeculativeDecoder(model, draft_model, tokenizer, block_size=16) |
| |
| # Generator-based streaming |
| for chunk in decoder.generate( |
| prompt="Tell me a story about a robot.", |
| max_tokens=512, |
| stream=True, # β Returns generator |
| ): |
| print(chunk, end="", flush=True) |
| ``` |
|
|
| --- |
|
|
| ## 4οΈβ£ Benchmark Mode |
|
|
| Compare DFlash vs baseline speed: |
|
|
| ```python |
| from mlx_lm import load |
| from dflash_mlx import DFlashSpeculativeDecoder |
| from dflash_mlx.convert import load_mlx_dflash |
| |
| model, tokenizer = load("mlx-community/Qwen3-4B-bf16") |
| draft_model, _ = load_mlx_dflash("~/models/dflash/Qwen3-4B-DFlash-mlx") |
| decoder = DFlashSpeculativeDecoder(model, draft_model, tokenizer, block_size=16) |
| |
| # Run benchmark |
| results = decoder.benchmark( |
| prompt="Write a quicksort in Python.", |
| max_tokens=512, |
| num_runs=5, |
| ) |
| |
| print(f"Speedup: {results['speedup']:.2f}x") |
| print(f"Tokens/sec: {results['tokens_per_sec']:.1f}") |
| ``` |
|
|
| Run: |
| ```bash |
| uv run python benchmark_script.py |
| ``` |
|
|
| **Sample results (M2 Pro Max 96GB):** |
| ``` |
| [Benchmark] Baseline: 2.34s | DFlash: 0.41s | Speedup: 5.71x | 1247.6 tok/s |
| ``` |
|
|
| --- |
|
|
| ## 5οΈβ£ Universal Decoder (Any Model Without Pre-built Drafter) |
|
|
| If your model doesn't have a DFlash drafter yet: |
|
|
| ```python |
| from mlx_lm import load |
| from dflash_mlx.universal import UniversalDFlashDecoder |
| |
| # Load ANY mlx_lm model |
| model, tokenizer = load("mlx-community/Llama-3.1-8B-Instruct-4bit") |
| |
| # UniversalDFlashDecoder: |
| # 1. Auto-detects architecture (LLaMA in this case) |
| # 2. Creates a generic 5-layer drafter (~500MB) |
| # 3. Sets up proper adapter for hidden state extraction |
| |
| decoder = UniversalDFlashDecoder( |
| target_model=model, |
| tokenizer=tokenizer, |
| draft_layers=5, |
| draft_hidden_size=1024, |
| block_size=16, |
| ) |
| |
| # Option A: Train a custom drafter (2-8 hours on Apple Silicon) |
| decoder.train_drafter( |
| dataset="open-web-math", # or local JSONL |
| epochs=6, |
| lr=6e-4, |
| batch_size=16, |
| output_path="~/models/dflash/my-llama-drafter", |
| ) |
| |
| # Option B: Use untrained (low quality, for testing only) |
| output = decoder.generate( |
| prompt="Hello world!", |
| max_tokens=100, |
| ) |
| ``` |
|
|
| --- |
|
|
| ## 6οΈβ£ OpenAI-Compatible Server |
|
|
| Run a local server compatible with OpenAI clients: |
|
|
| ```bash |
| # With uv (recommended) |
| uv run python -m dflash_mlx.serve \ |
| --target mlx-community/Qwen3-4B-bf16 \ |
| --draft ~/models/dflash/Qwen3-4B-DFlash-mlx \ |
| --block-size 16 \ |
| --port 8000 |
| |
| # Or in background |
| nohup uv run python -m dflash_mlx.serve \ |
| --target mlx-community/Qwen3-4B-bf16 \ |
| --draft ~/models/dflash/Qwen3-4B-DFlash-mlx \ |
| --port 8000 > dflash.log 2>&1 & |
| ``` |
|
|
| ### Query the server |
|
|
| ```bash |
| # Chat completion |
| curl http://localhost:8000/v1/chat/completions \ |
| -H "Content-Type: application/json" \ |
| -d '{ |
| "model": "qwen3-4b", |
| "messages": [{"role": "user", "content": "Explain quantum computing"}], |
| "max_tokens": 512, |
| "temperature": 0.0 |
| }' |
| |
| # Streaming |
| curl http://localhost:8000/v1/chat/completions \ |
| -H "Content-Type: application/json" \ |
| -d '{ |
| "model": "qwen3-4b", |
| "messages": [{"role": "user", "content": "Count to 10"}], |
| "max_tokens": 100, |
| "stream": true |
| }' |
| |
| # Check metrics |
| curl http://localhost:8000/metrics |
| ``` |
|
|
| ### Python client |
|
|
| ```python |
| from openai import OpenAI |
| |
| client = OpenAI( |
| base_url="http://localhost:8000/v1", |
| api_key="not-needed", # Local server, no auth |
| ) |
| |
| response = client.chat.completions.create( |
| model="qwen3-4b", |
| messages=[{"role": "user", "content": "Write a haiku about ML"}], |
| max_tokens=100, |
| ) |
| print(response.choices[0].message.content) |
| ``` |
|
|
| --- |
|
|
| ## 7οΈβ£ Using with Ollama, aider, Continue, etc. |
|
|
| Any OpenAI-compatible client works: |
|
|
| ### aider (AI coding assistant) |
| ```bash |
| aider --model openai/qwen3-4b \ |
| --openai-api-base http://localhost:8000/v1 \ |
| --openai-api-key not-needed |
| ``` |
|
|
| ### Continue.dev (VS Code extension) |
| ```json |
| // .continue/config.json |
| { |
| "models": [{ |
| "title": "DFlash Qwen3-4B", |
| "provider": "openai", |
| "model": "qwen3-4b", |
| "apiBase": "http://localhost:8000/v1", |
| "apiKey": "not-needed" |
| }] |
| } |
| ``` |
|
|
| ### Ollama (as custom endpoint) |
| Configure any OpenAI-compatible frontend to point at `http://localhost:8000/v1` |
|
|
| --- |
|
|
| ## 8οΈβ£ Supported Model Families |
|
|
| | Family | Target Model Example | Drafter Status | |
| |--------|---------------------|---------------| |
| | **Qwen3** | `mlx-community/Qwen3-4B-bf16` | β
Pre-built | |
| | **Qwen3.5** | `mlx-community/Qwen3.5-9B-4bit` | β
Pre-built | |
| | **Qwen3.6** | `mlx-community/Qwen3.6-27B-4bit` | β
Pre-built | |
| | **LLaMA 3.1** | `mlx-community/Llama-3.1-8B-Instruct-4bit` | β
Pre-built | |
| | **LLaMA 3.3** | `mlx-community/Llama-3.3-70B-Instruct-4bit` | β
Pre-built | |
| | **Mistral** | `mlx-community/Mistral-7B-Instruct-v0.3-4bit` | β οΈ Train custom | |
| | **Gemma** | `mlx-community/gemma-4-31b-it-4bit` | β
Pre-built | |
| | **Phi** | `mlx-community/Phi-3-mini-4k-instruct-4bit` | β οΈ Generic adapter | |
|
|
| --- |
|
|
| ## 9οΈβ£ Troubleshooting |
|
|
| ### "Unsupported model_type: phi" |
| ```python |
| # Add a custom adapter for your model |
| from dflash_mlx.adapters import MLXTargetAdapter, ADAPTERS |
|
|
| class PhiAdapter(MLXTargetAdapter): |
| family = "phi" |
| # Override methods as needed... |
| |
| ADAPTERS["phi"] = PhiAdapter |
| ``` |
| |
| ### "Vocab size mismatch" |
| Ensure target model and draft model share the same tokenizer vocabulary. Drafters are trained for specific target families. |
| |
| ### Slow first run |
| MLX compiles Metal kernels lazily. First generation is slow; subsequent runs are fast. The benchmark method includes warmup. |
| |
| ### Out of memory |
| - Reduce `--block-size` (default 16 β 8) |
| - Use 4-bit quantized target models (`-4bit` suffix) |
| - Reduce `max_tokens` |
| |
| ### Draft tokens all rejected |
| - Drafter may not match target model (wrong family) |
| - Use trained drafter for your specific model |
| - Check `target_layer_ids` alignment in config |
| |
| --- |
| |
| ## π Full Example Script |
| |
| Save as `run_dflash.py`: |
| |
| ```python |
| #!/usr/bin/env python3 |
| """Complete DFlash example with error handling.""" |
|
|
| import sys |
| from mlx_lm import load |
| from dflash_mlx import DFlashSpeculativeDecoder |
| from dflash_mlx.convert import load_mlx_dflash |
| |
| def main(): |
| # Configuration |
| TARGET_MODEL = "mlx-community/Qwen3-4B-bf16" |
| DRAFT_MODEL = "~/models/dflash/Qwen3-4B-DFlash-mlx" |
| PROMPT = "Explain how speculative decoding works." |
| MAX_TOKENS = 512 |
| |
| print(f"Loading target model: {TARGET_MODEL}") |
| model, tokenizer = load(TARGET_MODEL) |
| |
| print(f"Loading DFlash drafter: {DRAFT_MODEL}") |
| try: |
| draft_model, draft_config = load_mlx_dflash(DRAFT_MODEL) |
| except FileNotFoundError: |
| print(f"Error: Drafter not found at {DRAFT_MODEL}") |
| print("Convert first: uv run python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ~/models/dflash/Qwen3-4B-DFlash-mlx") |
| sys.exit(1) |
| |
| print("Creating DFlash decoder...") |
| decoder = DFlashSpeculativeDecoder( |
| target_model=model, |
| draft_model=draft_model, |
| tokenizer=tokenizer, |
| block_size=draft_config.get("block_size", 16), |
| ) |
| |
| print(f"\nPrompt: {PROMPT}") |
| print("-" * 60) |
| |
| output = decoder.generate( |
| prompt=PROMPT, |
| max_tokens=MAX_TOKENS, |
| temperature=0.0, |
| ) |
| |
| print(output) |
| print("-" * 60) |
| print("Done!") |
| |
| if __name__ == "__main__": |
| main() |
| ``` |
| |
| Run: |
| ```bash |
| uv run python run_dflash.py |
| ``` |
|
|
| --- |
|
|
| ## π Daily Workflow with `uv` |
|
|
| ```bash |
| # cd into your project |
| cd ~/projects/dflash-mlx-universal |
| |
| # Run any script β uv handles the virtual env automatically |
| uv run python examples/qwen3_4b_demo.py |
| |
| # Run the server |
| uv run python -m dflash_mlx.serve --target mlx-community/Qwen3-4B-bf16 --draft ./drafter --port 8000 |
| |
| # Run tests |
| uv run pytest tests/ -v |
| |
| # Format code |
| uv run black dflash_mlx/ |
| |
| # Lint |
| uv run ruff check dflash_mlx/ |
| |
| # Add a dependency |
| uv add numpy>=1.26.0 |
| |
| # Lock dependencies |
| uv lock |
| |
| # Sync environment with lock file |
| uv sync |
| ``` |
|
|
| --- |
|
|
| ## π Next Steps |
|
|
| 1. **Install `uv`** β `brew install uv` |
| 2. **Clone repo** β `git clone https://huggingface.co/tritesh/dflash-mlx-universal.git` |
| 3. **Install** β `cd dflash-mlx-universal && uv pip install -e ".[dev,server]"` |
| 4. **Convert drafter** β `uv run python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ./drafter` |
| 5. **Benchmark** β `uv run python examples/qwen3_4b_demo.py` |
| 6. **Start server** β `uv run python -m dflash_mlx.serve --target ... --draft ...` |
| 7. **Connect tools** β aider, Continue, custom clients |
| 8. **Train custom drafters** β For unsupported models using `UniversalDFlashDecoder` |
|
|
| --- |
|
|
| For questions/issues: https://huggingface.co/tritesh/dflash-mlx-universal/discussions |
|
|