tritesh
/

dflash-mlx-universal

+# DFlash-MLX-Universal: System Usage Guide
+> How to use `dflash-mlx-universal` on your Apple Silicon Mac (M1/M2/M3/M4)
+---
+## 📋 Prerequisites
+| Requirement | Version | Notes |
+|------------|---------|-------|
+| macOS | 14+ (Sonoma/Sequoia) | MLX requires Apple Silicon |
+| Python | 3.9 - 3.12 | Recommend 3.11 or 3.12 |
+| Chip | M1/M2/M3/M4 (Pro/Max/Ultra) | Unified memory required for large models |
+| Memory | 16GB+ minimum, 32GB+ recommended | 96GB for 70B+ models |
+---
+## 1️⃣ Installation
+```bash
+# 1. Create a virtual environment (recommended)
+python3 -m venv .venv-dflash
+source .venv-dflash/bin/activate  # On zsh/bash
+# 2. Upgrade pip
+pip install --upgrade pip
+# 3. Install core dependencies
+pip install mlx-lm>=0.24.0 transformers>=4.57.0 huggingface-hub>=0.25.0
+# 4. Install dflash-mlx-universal from your repo
+pip install git+https://huggingface.co/tritesh/dflash-mlx-universal.git
+# Optional: server mode
+pip install fastapi uvicorn
+```
+### Alternative: Install from local clone
+```bash
+git clone https://huggingface.co/tritesh/dflash-mlx-universal.git
+cd dflash-mlx-universal
+pip install -e .
+```
+---
+## 2️⃣ Quick Start — Using a Pre-converted Drafter
+### Step A: Convert an Official DFlash Drafter to MLX
+Official drafters are PyTorch models. You need to convert them to MLX format once:
+```bash
+# Convert Qwen3-4B drafter (~2-4 minutes on M2 Pro Max)
+python -m dflash_mlx.convert \
+    --model z-lab/Qwen3-4B-DFlash-b16 \
+    --output ~/models/dflash/Qwen3-4B-DFlash-mlx
+# Convert Qwen3.5-9B drafter
+python -m dflash_mlx.convert \
+    --model z-lab/Qwen3.5-9B-DFlash \
+    --output ~/models/dflash/Qwen3.5-9B-DFlash-mlx
+# Convert LLaMA-3.1-8B drafter
+python -m dflash_mlx.convert \
+    --model z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat \
+    --output ~/models/dflash/LLaMA3.1-8B-DFlash-mlx
+```
+**What this does:**
+- Downloads PyTorch weights from HF Hub
+- Transposes linear layers (PyTorch → MLX format)
+- Saves as `weights.npz` + `config.json`
+- Creates `model_info.json` with target model mapping
+---
+### Step B: Generate with DFlash Speculative Decoding
+```python
+from mlx_lm import load
+from dflash_mlx import DFlashSpeculativeDecoder
+from dflash_mlx.convert import load_mlx_dflash
+# 1. Load target model (any MLX-converted model)
+model, tokenizer = load("mlx-community/Qwen3-4B-bf16")
+# 2. Load converted DFlash drafter
+draft_model, draft_config = load_mlx_dflash("~/models/dflash/Qwen3-4B-DFlash-mlx")
+# 3. Create decoder (auto-detects architecture via adapters)
+decoder = DFlashSpeculativeDecoder(
+    target_model=model,
+    draft_model=draft_model,
+    tokenizer=tokenizer,
+    block_size=draft_config.get("block_size", 16),
+)
+# 4. Generate with 6× speedup
+output = decoder.generate(
+    prompt="Write a Python function to implement quicksort.",
+    max_tokens=1024,
+    temperature=0.0,  # Greedy for exact reproduction
+)
+print(output)
+```
+**Expected output:**
+```
+[DFlash] Prefill: processing 12 prompt tokens...
+[DFlash] Starting speculative decoding (block_size=16)...
+[DFlash] Done. Generated 1024 tokens, avg acceptance: 6.23, effective speedup: ~5.8x
+```
+---
+## 3️⃣ Streaming Generation
+For real-time output (chat UI, etc.):
+```python
+from mlx_lm import load
+from dflash_mlx import DFlashSpeculativeDecoder
+from dflash_mlx.convert import load_mlx_dflash
+model, tokenizer = load("mlx-community/Qwen3-4B-bf16")
+draft_model, _ = load_mlx_dflash("~/models/dflash/Qwen3-4B-DFlash-mlx")
+decoder = DFlashSpeculativeDecoder(model, draft_model, tokenizer, block_size=16)
+# Generator-based streaming
+for chunk in decoder.generate(
+    prompt="Tell me a story about a robot.",
+    max_tokens=512,
+    stream=True,  # ← Returns generator
+):
+    print(chunk, end="", flush=True)
+```
+---
+## 4️⃣ Benchmark Mode
+Compare DFlash vs baseline speed:
+```python
+from mlx_lm import load
+from dflash_mlx import DFlashSpeculativeDecoder
+from dflash_mlx.convert import load_mlx_dflash
+model, tokenizer = load("mlx-community/Qwen3-4B-bf16")
+draft_model, _ = load_mlx_dflash("~/models/dflash/Qwen3-4B-DFlash-mlx")
+decoder = DFlashSpeculativeDecoder(model, draft_model, tokenizer, block_size=16)
+# Run benchmark
+results = decoder.benchmark(
+    prompt="Write a quicksort in Python.",
+    max_tokens=512,
+    num_runs=5,
+)
+print(f"Speedup: {results['speedup']:.2f}x")
+print(f"Tokens/sec: {results['tokens_per_sec']:.1f}")
+```
+**Sample results (M2 Pro Max 96GB):**
+```
+[Benchmark] Baseline: 2.34s | DFlash: 0.41s | Speedup: 5.71x | 1247.6 tok/s
+```
+---
+## 5️⃣ Universal Decoder (Any Model Without Pre-built Drafter)
+If your model doesn't have a DFlash drafter yet:
+```python
+from mlx_lm import load
+from dflash_mlx.universal import UniversalDFlashDecoder
+# Load ANY mlx_lm model
+model, tokenizer = load("mlx-community/Llama-3.1-8B-Instruct-4bit")
+# UniversalDFlashDecoder:
+# 1. Auto-detects architecture (LLaMA in this case)
+# 2. Creates a generic 5-layer drafter (~500MB)
+# 3. Sets up proper adapter for hidden state extraction
+decoder = UniversalDFlashDecoder(
+    target_model=model,
+    tokenizer=tokenizer,
+    draft_layers=5,
+    draft_hidden_size=1024,
+    block_size=16,
+)
+# Option A: Train a custom drafter (2-8 hours)
+decoder.train_drafter(
+    dataset="open-web-math",  # or local JSONL
+    epochs=6,
+    lr=6e-4,
+    batch_size=16,
+    output_path="~/models/dflash/my-llama-drafter",
+)
+# Option B: Use untrained (low quality, for testing only)
+output = decoder.generate(
+    prompt="Hello world!",
+    max_tokens=100,
+)
+```
+---
+## 6️⃣ OpenAI-Compatible Server
+Run a local server compatible with OpenAI clients:
+```bash
+# Start server
+python -m dflash_mlx.serve \
+    --target mlx-community/Qwen3-4B-bf16 \
+    --draft ~/models/dflash/Qwen3-4B-DFlash-mlx \
+    --block-size 16 \
+    --port 8000
+# Or in background
+nohup python -m dflash_mlx.serve \
+    --target mlx-community/Qwen3-4B-bf16 \
+    --draft ~/models/dflash/Qwen3-4B-DFlash-mlx \
+    --port 8000 > dflash.log 2>&1 &
+```
+### Query the server
+```bash
+# Chat completion
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "qwen3-4b",
+    "messages": [{"role": "user", "content": "Explain quantum computing"}],
+    "max_tokens": 512,
+    "temperature": 0.0
+  }'
+# Streaming
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "qwen3-4b",
+    "messages": [{"role": "user", "content": "Count to 10"}],
+    "max_tokens": 100,
+    "stream": true
+  }'
+# Check metrics
+curl http://localhost:8000/metrics
+```
+### Python client
+```python
+from openai import OpenAI
+client = OpenAI(
+    base_url="http://localhost:8000/v1",
+    api_key="not-needed",  # Local server, no auth
+)
+response = client.chat.completions.create(
+    model="qwen3-4b",
+    messages=[{"role": "user", "content": "Write a haiku about ML"}],
+    max_tokens=100,
+)
+print(response.choices[0].message.content)
+```
+---
+## 7️⃣ Using with Ollama, aider, Continue, etc.
+Any OpenAI-compatible client works:
+### aider (AI coding assistant)
+```bash
+aider --model openai/qwen3-4b --openai-api-base http://localhost:8000/v1 --openai-api-key not-needed
+```
+### Continue.dev (VS Code extension)
+```json
+// .continue/config.json
+{
+  "models": [{
+    "title": "DFlash Qwen3-4B",
+    "provider": "openai",
+    "model": "qwen3-4b",
+    "apiBase": "http://localhost:8000/v1",
+    "apiKey": "not-needed"
+  }]
+}
+```
+### Ollama (as custom endpoint)
+Configure any OpenAI-compatible frontend to point at `http://localhost:8000/v1`
+---
+## 8️⃣ Supported Model Families
+| Family | Target Model Example | Drafter Status |
+|--------|---------------------|---------------|
+| **Qwen3** | `mlx-community/Qwen3-4B-bf16` | ✅ Pre-built |
+| **Qwen3.5** | `mlx-community/Qwen3.5-9B-4bit` | ✅ Pre-built |
+| **Qwen3.6** | `mlx-community/Qwen3.6-27B-4bit` | ✅ Pre-built |
+| **LLaMA 3.1** | `mlx-community/Llama-3.1-8B-Instruct-4bit` | ✅ Pre-built |
+| **LLaMA 3.3** | `mlx-community/Llama-3.3-70B-Instruct-4bit` | ✅ Pre-built |
+| **Mistral** | `mlx-community/Mistral-7B-Instruct-v0.3-4bit` | ⚠️ Train custom |
+| **Gemma** | `mlx-community/gemma-4-31b-it-4bit` | ✅ Pre-built |
+| **Phi** | `mlx-community/Phi-3-mini-4k-instruct-4bit` | ⚠️ Generic adapter |
+---
+## 9️⃣ Troubleshooting
+### "Unsupported model_type: phi"
+```python
+# Add a custom adapter for your model
+from dflash_mlx.adapters import MLXTargetAdapter, ADAPTERS
+class PhiAdapter(MLXTargetAdapter):
+    family = "phi"
+    # Override methods as needed...
+ADAPTERS["phi"] = PhiAdapter
+```
+### "Vocab size mismatch"
+Ensure target model and draft model share the same tokenizer vocabulary. Drafters are trained for specific target families.
+### Slow first run
+MLX compiles Metal kernels lazily. First generation is slow; subsequent runs are fast. The benchmark method includes warmup.
+### Out of memory
+- Reduce `--block-size` (default 16 → 8)
+- Use 4-bit quantized target models (`-4bit` suffix)
+- Reduce `max_tokens`
+### Draft tokens all rejected
+- Drafter may not match target model (wrong family)
+- Use trained drafter for your specific model
+- Check `target_layer_ids` alignment in config
+---
+## 🔟 Full Example Script
+Save as `run_dflash.py`:
+```python
+#!/usr/bin/env python3
+"""Complete DFlash example with error handling."""
+import sys
+from mlx_lm import load
+from dflash_mlx import DFlashSpeculativeDecoder
+from dflash_mlx.convert import load_mlx_dflash
+def main():
+    # Configuration
+    TARGET_MODEL = "mlx-community/Qwen3-4B-bf16"
+    DRAFT_MODEL = "~/models/dflash/Qwen3-4B-DFlash-mlx"
+    PROMPT = "Explain how speculative decoding works."
+    MAX_TOKENS = 512
+    print(f"Loading target model: {TARGET_MODEL}")
+    model, tokenizer = load(TARGET_MODEL)
+    print(f"Loading DFlash drafter: {DRAFT_MODEL}")
+    try:
+        draft_model, draft_config = load_mlx_dflash(DRAFT_MODEL)
+    except FileNotFoundError:
+        print(f"Error: Drafter not found at {DRAFT_MODEL}")
+        print("Convert first: python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ~/models/dflash/Qwen3-4B-DFlash-mlx")
+        sys.exit(1)
+    print("Creating DFlash decoder...")
+    decoder = DFlashSpeculativeDecoder(
+        target_model=model,
+        draft_model=draft_model,
+        tokenizer=tokenizer,
+        block_size=draft_config.get("block_size", 16),
+    )
+    print(f"\nPrompt: {PROMPT}")
+    print("-" * 60)
+    output = decoder.generate(
+        prompt=PROMPT,
+        max_tokens=MAX_TOKENS,
+        temperature=0.0,
+    )
+    print(output)
+    print("-" * 60)
+    print("Done!")
+if __name__ == "__main__":
+    main()
+```
+Run:
+```bash
+python run_dflash.py
+```
+---
+## 📚 Next Steps
+1. **Convert your first drafter** → `python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ./drafter`
+2. **Benchmark it** → Use `decoder.benchmark(...)` to verify speedup
+3. **Start the server** → `python -m dflash_mlx.serve --target ... --draft ...`
+4. **Connect your tools** → aider, Continue, custom clients
+5. **Train custom drafters** → For unsupported models using `UniversalDFlashDecoder`
+---
+For questions/issues: https://huggingface.co/tritesh/dflash-mlx-universal/discussions