# DFlash-MLX-Universal: System Usage Guide > How to use `dflash-mlx-universal` on your Apple Silicon Mac (M1/M2/M3/M4) --- ## 📋 Prerequisites | Requirement | Version | Notes | |------------|---------|-------| | macOS | 14+ (Sonoma/Sequoia) | MLX requires Apple Silicon | | Python | 3.9 - 3.12 | Recommend 3.11 or 3.12 | | Chip | M1/M2/M3/M4 (Pro/Max/Ultra) | Unified memory required for large models | | Memory | 16GB+ minimum, 32GB+ recommended | 96GB for 70B+ models | --- ## 1️⃣ Installation (Recommended: `uv`) [`uv`](https://github.com/astral-sh/uv) is an extremely fast Python package manager written in Rust. It's the recommended way to install `dflash-mlx-universal`. ### Install `uv` (One-time) ```bash # Option A: Homebrew (macOS) brew install uv # Option B: Official installer curl -LsSf https://astral.sh/uv/install.sh | sh # Verify uv --version # Should show 0.6.x or higher ``` ### Install DFlash-MLX-Universal with `uv` ```bash # 1. Clone the repo git clone https://huggingface.co/tritesh/dflash-mlx-universal.git cd dflash-mlx-universal # 2. Create virtual environment with uv (uses .python-version file) uv venv # 3. Install in editable mode with all dependencies uv pip install -e ".[dev,server]" # Or install directly from the repo uv pip install "git+https://huggingface.co/tritesh/dflash-mlx-universal.git[dev,server]" ``` ### Alternative: `uv` project workflow (no manual venv) ```bash # 1. Enter project directory cd dflash-mlx-universal # 2. uv automatically reads pyproject.toml and .python-version uv run python -c "import dflash_mlx; print(dflash_mlx.__version__)" # 3. Lock dependencies (creates uv.lock) uv lock # 4. Run any script with automatic dependency resolution uv run python examples/qwen3_4b_demo.py # 5. Run tests uv run pytest tests/ -v # 6. Start server uv run python -m dflash_mlx.serve --target mlx-community/Qwen3-4B-bf16 --draft ./Qwen3-4B-DFlash-mlx --port 8000 ``` ### With `uv` and dependency groups ```bash # Install only core dependencies uv pip install -e . # Install with server extras (FastAPI + uvicorn) uv pip install -e ".[server]" # Install with dev extras (pytest, black, ruff) uv pip install -e ".[dev]" # Install everything at once uv pip install -e ".[dev,server]" ``` --- ## 1️⃣-alt Installation (Classic `pip`) If you prefer `pip`: ```bash # 1. Create virtual environment python3 -m venv .venv-dflash source .venv-dflash/bin/activate # On zsh/bash # 2. Upgrade pip pip install --upgrade pip # 3. Install core dependencies pip install mlx-lm>=0.24.0 transformers>=4.57.0 huggingface-hub>=0.25.0 # 4. Install dflash-mlx-universal from your repo pip install git+https://huggingface.co/tritesh/dflash-mlx-universal.git # Optional: server mode pip install fastapi uvicorn ``` --- ## 2️⃣ Quick Start — Using a Pre-converted Drafter ### Step A: Convert an Official DFlash Drafter to MLX Official drafters are PyTorch models. You need to convert them to MLX format once: ```bash # With uv (recommended) uv run python -m dflash_mlx.convert \ --model z-lab/Qwen3-4B-DFlash-b16 \ --output ~/models/dflash/Qwen3-4B-DFlash-mlx # With classic pip python -m dflash_mlx.convert \ --model z-lab/Qwen3-4B-DFlash-b16 \ --output ~/models/dflash/Qwen3-4B-DFlash-mlx ``` **Supported drafters:** ```bash # Qwen3 series uv run python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ~/models/dflash/Qwen3-4B-DFlash-mlx uv run python -m dflash_mlx.convert --model z-lab/Qwen3-8B-DFlash-b16 --output ~/models/dflash/Qwen3-8B-DFlash-mlx # Qwen3.5 series uv run python -m dflash_mlx.convert --model z-lab/Qwen3.5-9B-DFlash --output ~/models/dflash/Qwen3.5-9B-DFlash-mlx uv run python -m dflash_mlx.convert --model z-lab/Qwen3.5-27B-DFlash --output ~/models/dflash/Qwen3.5-27B-DFlash-mlx # Qwen3.6 series uv run python -m dflash_mlx.convert --model z-lab/Qwen3.6-27B-DFlash --output ~/models/dflash/Qwen3.6-27B-DFlash-mlx uv run python -m dflash_mlx.convert --model z-lab/Qwen3.6-35B-A3B-DFlash --output ~/models/dflash/Qwen3.6-35B-DFlash-mlx # LLaMA uv run python -m dflash_mlx.convert --model z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat --output ~/models/dflash/LLaMA3.1-8B-DFlash-mlx # Gemma uv run python -m dflash_mlx.convert --model z-lab/gemma-4-31B-it-DFlash --output ~/models/dflash/gemma-4-31B-DFlash-mlx # GPT-OSS uv run python -m dflash_mlx.convert --model z-lab/gpt-oss-20b-DFlash --output ~/models/dflash/gpt-oss-20b-DFlash-mlx ``` **What this does:** - Downloads PyTorch weights from HF Hub - Transposes linear layers (PyTorch → MLX format) - Saves as `weights.npz` + `config.json` - Creates `model_info.json` with target model mapping --- ### Step B: Generate with DFlash Speculative Decoding ```python from mlx_lm import load from dflash_mlx import DFlashSpeculativeDecoder from dflash_mlx.convert import load_mlx_dflash # 1. Load target model (any MLX-converted model) model, tokenizer = load("mlx-community/Qwen3-4B-bf16") # 2. Load converted DFlash drafter draft_model, draft_config = load_mlx_dflash("~/models/dflash/Qwen3-4B-DFlash-mlx") # 3. Create decoder (auto-detects architecture via adapters) decoder = DFlashSpeculativeDecoder( target_model=model, draft_model=draft_model, tokenizer=tokenizer, block_size=draft_config.get("block_size", 16), ) # 4. Generate with 6× speedup output = decoder.generate( prompt="Write a Python function to implement quicksort.", max_tokens=1024, temperature=0.0, # Greedy for exact reproduction ) print(output) ``` Run with `uv`: ```bash uv run python my_generate_script.py ``` **Expected output:** ``` [DFlash] Prefill: processing 12 prompt tokens... [DFlash] Starting speculative decoding (block_size=16)... [DFlash] Done. Generated 1024 tokens, avg acceptance: 6.23, effective speedup: ~5.8x ``` --- ## 3️⃣ Streaming Generation For real-time output (chat UI, etc.): ```python from mlx_lm import load from dflash_mlx import DFlashSpeculativeDecoder from dflash_mlx.convert import load_mlx_dflash model, tokenizer = load("mlx-community/Qwen3-4B-bf16") draft_model, _ = load_mlx_dflash("~/models/dflash/Qwen3-4B-DFlash-mlx") decoder = DFlashSpeculativeDecoder(model, draft_model, tokenizer, block_size=16) # Generator-based streaming for chunk in decoder.generate( prompt="Tell me a story about a robot.", max_tokens=512, stream=True, # ← Returns generator ): print(chunk, end="", flush=True) ``` --- ## 4️⃣ Benchmark Mode Compare DFlash vs baseline speed: ```python from mlx_lm import load from dflash_mlx import DFlashSpeculativeDecoder from dflash_mlx.convert import load_mlx_dflash model, tokenizer = load("mlx-community/Qwen3-4B-bf16") draft_model, _ = load_mlx_dflash("~/models/dflash/Qwen3-4B-DFlash-mlx") decoder = DFlashSpeculativeDecoder(model, draft_model, tokenizer, block_size=16) # Run benchmark results = decoder.benchmark( prompt="Write a quicksort in Python.", max_tokens=512, num_runs=5, ) print(f"Speedup: {results['speedup']:.2f}x") print(f"Tokens/sec: {results['tokens_per_sec']:.1f}") ``` Run: ```bash uv run python benchmark_script.py ``` **Sample results (M2 Pro Max 96GB):** ``` [Benchmark] Baseline: 2.34s | DFlash: 0.41s | Speedup: 5.71x | 1247.6 tok/s ``` --- ## 5️⃣ Universal Decoder (Any Model Without Pre-built Drafter) If your model doesn't have a DFlash drafter yet: ```python from mlx_lm import load from dflash_mlx.universal import UniversalDFlashDecoder # Load ANY mlx_lm model model, tokenizer = load("mlx-community/Llama-3.1-8B-Instruct-4bit") # UniversalDFlashDecoder: # 1. Auto-detects architecture (LLaMA in this case) # 2. Creates a generic 5-layer drafter (~500MB) # 3. Sets up proper adapter for hidden state extraction decoder = UniversalDFlashDecoder( target_model=model, tokenizer=tokenizer, draft_layers=5, draft_hidden_size=1024, block_size=16, ) # Option A: Train a custom drafter (2-8 hours on Apple Silicon) decoder.train_drafter( dataset="open-web-math", # or local JSONL epochs=6, lr=6e-4, batch_size=16, output_path="~/models/dflash/my-llama-drafter", ) # Option B: Use untrained (low quality, for testing only) output = decoder.generate( prompt="Hello world!", max_tokens=100, ) ``` --- ## 6️⃣ OpenAI-Compatible Server Run a local server compatible with OpenAI clients: ```bash # With uv (recommended) uv run python -m dflash_mlx.serve \ --target mlx-community/Qwen3-4B-bf16 \ --draft ~/models/dflash/Qwen3-4B-DFlash-mlx \ --block-size 16 \ --port 8000 # Or in background nohup uv run python -m dflash_mlx.serve \ --target mlx-community/Qwen3-4B-bf16 \ --draft ~/models/dflash/Qwen3-4B-DFlash-mlx \ --port 8000 > dflash.log 2>&1 & ``` ### Query the server ```bash # Chat completion curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "qwen3-4b", "messages": [{"role": "user", "content": "Explain quantum computing"}], "max_tokens": 512, "temperature": 0.0 }' # Streaming curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "qwen3-4b", "messages": [{"role": "user", "content": "Count to 10"}], "max_tokens": 100, "stream": true }' # Check metrics curl http://localhost:8000/metrics ``` ### Python client ```python from openai import OpenAI client = OpenAI( base_url="http://localhost:8000/v1", api_key="not-needed", # Local server, no auth ) response = client.chat.completions.create( model="qwen3-4b", messages=[{"role": "user", "content": "Write a haiku about ML"}], max_tokens=100, ) print(response.choices[0].message.content) ``` --- ## 7️⃣ Using with Ollama, aider, Continue, etc. Any OpenAI-compatible client works: ### aider (AI coding assistant) ```bash aider --model openai/qwen3-4b \ --openai-api-base http://localhost:8000/v1 \ --openai-api-key not-needed ``` ### Continue.dev (VS Code extension) ```json // .continue/config.json { "models": [{ "title": "DFlash Qwen3-4B", "provider": "openai", "model": "qwen3-4b", "apiBase": "http://localhost:8000/v1", "apiKey": "not-needed" }] } ``` ### Ollama (as custom endpoint) Configure any OpenAI-compatible frontend to point at `http://localhost:8000/v1` --- ## 8️⃣ Supported Model Families | Family | Target Model Example | Drafter Status | |--------|---------------------|---------------| | **Qwen3** | `mlx-community/Qwen3-4B-bf16` | ✅ Pre-built | | **Qwen3.5** | `mlx-community/Qwen3.5-9B-4bit` | ✅ Pre-built | | **Qwen3.6** | `mlx-community/Qwen3.6-27B-4bit` | ✅ Pre-built | | **LLaMA 3.1** | `mlx-community/Llama-3.1-8B-Instruct-4bit` | ✅ Pre-built | | **LLaMA 3.3** | `mlx-community/Llama-3.3-70B-Instruct-4bit` | ✅ Pre-built | | **Mistral** | `mlx-community/Mistral-7B-Instruct-v0.3-4bit` | ⚠️ Train custom | | **Gemma** | `mlx-community/gemma-4-31b-it-4bit` | ✅ Pre-built | | **Phi** | `mlx-community/Phi-3-mini-4k-instruct-4bit` | ⚠️ Generic adapter | --- ## 9️⃣ Troubleshooting ### "Unsupported model_type: phi" ```python # Add a custom adapter for your model from dflash_mlx.adapters import MLXTargetAdapter, ADAPTERS class PhiAdapter(MLXTargetAdapter): family = "phi" # Override methods as needed... ADAPTERS["phi"] = PhiAdapter ``` ### "Vocab size mismatch" Ensure target model and draft model share the same tokenizer vocabulary. Drafters are trained for specific target families. ### Slow first run MLX compiles Metal kernels lazily. First generation is slow; subsequent runs are fast. The benchmark method includes warmup. ### Out of memory - Reduce `--block-size` (default 16 → 8) - Use 4-bit quantized target models (`-4bit` suffix) - Reduce `max_tokens` ### Draft tokens all rejected - Drafter may not match target model (wrong family) - Use trained drafter for your specific model - Check `target_layer_ids` alignment in config --- ## 🔟 Full Example Script Save as `run_dflash.py`: ```python #!/usr/bin/env python3 """Complete DFlash example with error handling.""" import sys from mlx_lm import load from dflash_mlx import DFlashSpeculativeDecoder from dflash_mlx.convert import load_mlx_dflash def main(): # Configuration TARGET_MODEL = "mlx-community/Qwen3-4B-bf16" DRAFT_MODEL = "~/models/dflash/Qwen3-4B-DFlash-mlx" PROMPT = "Explain how speculative decoding works." MAX_TOKENS = 512 print(f"Loading target model: {TARGET_MODEL}") model, tokenizer = load(TARGET_MODEL) print(f"Loading DFlash drafter: {DRAFT_MODEL}") try: draft_model, draft_config = load_mlx_dflash(DRAFT_MODEL) except FileNotFoundError: print(f"Error: Drafter not found at {DRAFT_MODEL}") print("Convert first: uv run python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ~/models/dflash/Qwen3-4B-DFlash-mlx") sys.exit(1) print("Creating DFlash decoder...") decoder = DFlashSpeculativeDecoder( target_model=model, draft_model=draft_model, tokenizer=tokenizer, block_size=draft_config.get("block_size", 16), ) print(f"\nPrompt: {PROMPT}") print("-" * 60) output = decoder.generate( prompt=PROMPT, max_tokens=MAX_TOKENS, temperature=0.0, ) print(output) print("-" * 60) print("Done!") if __name__ == "__main__": main() ``` Run: ```bash uv run python run_dflash.py ``` --- ## 🔄 Daily Workflow with `uv` ```bash # cd into your project cd ~/projects/dflash-mlx-universal # Run any script — uv handles the virtual env automatically uv run python examples/qwen3_4b_demo.py # Run the server uv run python -m dflash_mlx.serve --target mlx-community/Qwen3-4B-bf16 --draft ./drafter --port 8000 # Run tests uv run pytest tests/ -v # Format code uv run black dflash_mlx/ # Lint uv run ruff check dflash_mlx/ # Add a dependency uv add numpy>=1.26.0 # Lock dependencies uv lock # Sync environment with lock file uv sync ``` --- ## 📚 Next Steps 1. **Install `uv`** → `brew install uv` 2. **Clone repo** → `git clone https://huggingface.co/tritesh/dflash-mlx-universal.git` 3. **Install** → `cd dflash-mlx-universal && uv pip install -e ".[dev,server]"` 4. **Convert drafter** → `uv run python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ./drafter` 5. **Benchmark** → `uv run python examples/qwen3_4b_demo.py` 6. **Start server** → `uv run python -m dflash_mlx.serve --target ... --draft ...` 7. **Connect tools** → aider, Continue, custom clients 8. **Train custom drafters** → For unsupported models using `UniversalDFlashDecoder` --- For questions/issues: https://huggingface.co/tritesh/dflash-mlx-universal/discussions