File size: 14,863 Bytes

# DFlash-MLX-Universal: System Usage Guide

> How to use `dflash-mlx-universal` on your Apple Silicon Mac (M1/M2/M3/M4)

---

## 📋 Prerequisites

| Requirement | Version | Notes |
|------------|---------|-------|
| macOS | 14+ (Sonoma/Sequoia) | MLX requires Apple Silicon |
| Python | 3.9 - 3.12 | Recommend 3.11 or 3.12 |
| Chip | M1/M2/M3/M4 (Pro/Max/Ultra) | Unified memory required for large models |
| Memory | 16GB+ minimum, 32GB+ recommended | 96GB for 70B+ models |

---

## 1️⃣ Installation (Recommended: `uv`)

[`uv`](https://github.com/astral-sh/uv) is an extremely fast Python package manager written in Rust. It's the recommended way to install `dflash-mlx-universal`.

### Install `uv` (One-time)

```bash
# Option A: Homebrew (macOS)
brew install uv

# Option B: Official installer
curl -LsSf https://astral.sh/uv/install.sh | sh

# Verify
uv --version  # Should show 0.6.x or higher
```

### Install DFlash-MLX-Universal with `uv`

```bash
# 1. Clone the repo
git clone https://huggingface.co/tritesh/dflash-mlx-universal.git
cd dflash-mlx-universal

# 2. Create virtual environment with uv (uses .python-version file)
uv venv

# 3. Install in editable mode with all dependencies
uv pip install -e ".[dev,server]"

# Or install directly from the repo
uv pip install "git+https://huggingface.co/tritesh/dflash-mlx-universal.git[dev,server]"
```

### Alternative: `uv` project workflow (no manual venv)

```bash
# 1. Enter project directory
cd dflash-mlx-universal

# 2. uv automatically reads pyproject.toml and .python-version
uv run python -c "import dflash_mlx; print(dflash_mlx.__version__)"

# 3. Lock dependencies (creates uv.lock)
uv lock

# 4. Run any script with automatic dependency resolution
uv run python examples/qwen3_4b_demo.py

# 5. Run tests
uv run pytest tests/ -v

# 6. Start server
uv run python -m dflash_mlx.serve --target mlx-community/Qwen3-4B-bf16 --draft ./Qwen3-4B-DFlash-mlx --port 8000
```

### With `uv` and dependency groups

```bash
# Install only core dependencies
uv pip install -e .

# Install with server extras (FastAPI + uvicorn)
uv pip install -e ".[server]"

# Install with dev extras (pytest, black, ruff)
uv pip install -e ".[dev]"

# Install everything at once
uv pip install -e ".[dev,server]"
```

---

## 1️⃣-alt Installation (Classic `pip`)

If you prefer `pip`:

```bash
# 1. Create virtual environment
python3 -m venv .venv-dflash
source .venv-dflash/bin/activate  # On zsh/bash

# 2. Upgrade pip
pip install --upgrade pip

# 3. Install core dependencies
pip install mlx-lm>=0.24.0 transformers>=4.57.0 huggingface-hub>=0.25.0

# 4. Install dflash-mlx-universal from your repo
pip install git+https://huggingface.co/tritesh/dflash-mlx-universal.git

# Optional: server mode
pip install fastapi uvicorn
```

---

## 2️⃣ Quick Start — Using a Pre-converted Drafter

### Step A: Convert an Official DFlash Drafter to MLX

Official drafters are PyTorch models. You need to convert them to MLX format once:

```bash
# With uv (recommended)
uv run python -m dflash_mlx.convert \
    --model z-lab/Qwen3-4B-DFlash-b16 \
    --output ~/models/dflash/Qwen3-4B-DFlash-mlx

# With classic pip
python -m dflash_mlx.convert \
    --model z-lab/Qwen3-4B-DFlash-b16 \
    --output ~/models/dflash/Qwen3-4B-DFlash-mlx
```

**Supported drafters:**
```bash
# Qwen3 series
uv run python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16  --output ~/models/dflash/Qwen3-4B-DFlash-mlx
uv run python -m dflash_mlx.convert --model z-lab/Qwen3-8B-DFlash-b16  --output ~/models/dflash/Qwen3-8B-DFlash-mlx

# Qwen3.5 series
uv run python -m dflash_mlx.convert --model z-lab/Qwen3.5-9B-DFlash    --output ~/models/dflash/Qwen3.5-9B-DFlash-mlx
uv run python -m dflash_mlx.convert --model z-lab/Qwen3.5-27B-DFlash   --output ~/models/dflash/Qwen3.5-27B-DFlash-mlx

# Qwen3.6 series
uv run python -m dflash_mlx.convert --model z-lab/Qwen3.6-27B-DFlash   --output ~/models/dflash/Qwen3.6-27B-DFlash-mlx
uv run python -m dflash_mlx.convert --model z-lab/Qwen3.6-35B-A3B-DFlash --output ~/models/dflash/Qwen3.6-35B-DFlash-mlx

# LLaMA
uv run python -m dflash_mlx.convert --model z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat --output ~/models/dflash/LLaMA3.1-8B-DFlash-mlx

# Gemma
uv run python -m dflash_mlx.convert --model z-lab/gemma-4-31B-it-DFlash --output ~/models/dflash/gemma-4-31B-DFlash-mlx

# GPT-OSS
uv run python -m dflash_mlx.convert --model z-lab/gpt-oss-20b-DFlash --output ~/models/dflash/gpt-oss-20b-DFlash-mlx
```

**What this does:**
- Downloads PyTorch weights from HF Hub
- Transposes linear layers (PyTorch → MLX format)
- Saves as `weights.npz` + `config.json`
- Creates `model_info.json` with target model mapping

---

### Step B: Generate with DFlash Speculative Decoding

```python
from mlx_lm import load
from dflash_mlx import DFlashSpeculativeDecoder
from dflash_mlx.convert import load_mlx_dflash

# 1. Load target model (any MLX-converted model)
model, tokenizer = load("mlx-community/Qwen3-4B-bf16")

# 2. Load converted DFlash drafter
draft_model, draft_config = load_mlx_dflash("~/models/dflash/Qwen3-4B-DFlash-mlx")

# 3. Create decoder (auto-detects architecture via adapters)
decoder = DFlashSpeculativeDecoder(
    target_model=model,
    draft_model=draft_model,
    tokenizer=tokenizer,
    block_size=draft_config.get("block_size", 16),
)

# 4. Generate with 6× speedup
output = decoder.generate(
    prompt="Write a Python function to implement quicksort.",
    max_tokens=1024,
    temperature=0.0,  # Greedy for exact reproduction
)
print(output)
```

Run with `uv`:
```bash
uv run python my_generate_script.py
```

**Expected output:**
```
[DFlash] Prefill: processing 12 prompt tokens...
[DFlash] Starting speculative decoding (block_size=16)...
[DFlash] Done. Generated 1024 tokens, avg acceptance: 6.23, effective speedup: ~5.8x
```

---

## 3️⃣ Streaming Generation

For real-time output (chat UI, etc.):

```python
from mlx_lm import load
from dflash_mlx import DFlashSpeculativeDecoder
from dflash_mlx.convert import load_mlx_dflash

model, tokenizer = load("mlx-community/Qwen3-4B-bf16")
draft_model, _ = load_mlx_dflash("~/models/dflash/Qwen3-4B-DFlash-mlx")
decoder = DFlashSpeculativeDecoder(model, draft_model, tokenizer, block_size=16)

# Generator-based streaming
for chunk in decoder.generate(
    prompt="Tell me a story about a robot.",
    max_tokens=512,
    stream=True,  # ← Returns generator
):
    print(chunk, end="", flush=True)
```

---

## 4️⃣ Benchmark Mode

Compare DFlash vs baseline speed:

```python
from mlx_lm import load
from dflash_mlx import DFlashSpeculativeDecoder
from dflash_mlx.convert import load_mlx_dflash

model, tokenizer = load("mlx-community/Qwen3-4B-bf16")
draft_model, _ = load_mlx_dflash("~/models/dflash/Qwen3-4B-DFlash-mlx")
decoder = DFlashSpeculativeDecoder(model, draft_model, tokenizer, block_size=16)

# Run benchmark
results = decoder.benchmark(
    prompt="Write a quicksort in Python.",
    max_tokens=512,
    num_runs=5,
)

print(f"Speedup: {results['speedup']:.2f}x")
print(f"Tokens/sec: {results['tokens_per_sec']:.1f}")
```

Run:
```bash
uv run python benchmark_script.py
```

**Sample results (M2 Pro Max 96GB):**
```
[Benchmark] Baseline: 2.34s | DFlash: 0.41s | Speedup: 5.71x | 1247.6 tok/s
```

---

## 5️⃣ Universal Decoder (Any Model Without Pre-built Drafter)

If your model doesn't have a DFlash drafter yet:

```python
from mlx_lm import load
from dflash_mlx.universal import UniversalDFlashDecoder

# Load ANY mlx_lm model
model, tokenizer = load("mlx-community/Llama-3.1-8B-Instruct-4bit")

# UniversalDFlashDecoder:
# 1. Auto-detects architecture (LLaMA in this case)
# 2. Creates a generic 5-layer drafter (~500MB)
# 3. Sets up proper adapter for hidden state extraction

decoder = UniversalDFlashDecoder(
    target_model=model,
    tokenizer=tokenizer,
    draft_layers=5,
    draft_hidden_size=1024,
    block_size=16,
)

# Option A: Train a custom drafter (2-8 hours on Apple Silicon)
decoder.train_drafter(
    dataset="open-web-math",  # or local JSONL
    epochs=6,
    lr=6e-4,
    batch_size=16,
    output_path="~/models/dflash/my-llama-drafter",
)

# Option B: Use untrained (low quality, for testing only)
output = decoder.generate(
    prompt="Hello world!",
    max_tokens=100,
)
```

---

## 6️⃣ OpenAI-Compatible Server

Run a local server compatible with OpenAI clients:

```bash
# With uv (recommended)
uv run python -m dflash_mlx.serve \
    --target mlx-community/Qwen3-4B-bf16 \
    --draft ~/models/dflash/Qwen3-4B-DFlash-mlx \
    --block-size 16 \
    --port 8000

# Or in background
nohup uv run python -m dflash_mlx.serve \
    --target mlx-community/Qwen3-4B-bf16 \
    --draft ~/models/dflash/Qwen3-4B-DFlash-mlx \
    --port 8000 > dflash.log 2>&1 &
```

### Query the server

```bash
# Chat completion
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-4b",
    "messages": [{"role": "user", "content": "Explain quantum computing"}],
    "max_tokens": 512,
    "temperature": 0.0
  }'

# Streaming
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-4b",
    "messages": [{"role": "user", "content": "Count to 10"}],
    "max_tokens": 100,
    "stream": true
  }'

# Check metrics
curl http://localhost:8000/metrics
```

### Python client

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",  # Local server, no auth
)

response = client.chat.completions.create(
    model="qwen3-4b",
    messages=[{"role": "user", "content": "Write a haiku about ML"}],
    max_tokens=100,
)
print(response.choices[0].message.content)
```

---

## 7️⃣ Using with Ollama, aider, Continue, etc.

Any OpenAI-compatible client works:

### aider (AI coding assistant)
```bash
aider --model openai/qwen3-4b \
    --openai-api-base http://localhost:8000/v1 \
    --openai-api-key not-needed
```

### Continue.dev (VS Code extension)
```json
// .continue/config.json
{
  "models": [{
    "title": "DFlash Qwen3-4B",
    "provider": "openai",
    "model": "qwen3-4b",
    "apiBase": "http://localhost:8000/v1",
    "apiKey": "not-needed"
  }]
}
```

### Ollama (as custom endpoint)
Configure any OpenAI-compatible frontend to point at `http://localhost:8000/v1`

---

## 8️⃣ Supported Model Families

| Family | Target Model Example | Drafter Status |
|--------|---------------------|---------------|
| **Qwen3** | `mlx-community/Qwen3-4B-bf16` | ✅ Pre-built |
| **Qwen3.5** | `mlx-community/Qwen3.5-9B-4bit` | ✅ Pre-built |
| **Qwen3.6** | `mlx-community/Qwen3.6-27B-4bit` | ✅ Pre-built |
| **LLaMA 3.1** | `mlx-community/Llama-3.1-8B-Instruct-4bit` | ✅ Pre-built |
| **LLaMA 3.3** | `mlx-community/Llama-3.3-70B-Instruct-4bit` | ✅ Pre-built |
| **Mistral** | `mlx-community/Mistral-7B-Instruct-v0.3-4bit` | ⚠️ Train custom |
| **Gemma** | `mlx-community/gemma-4-31b-it-4bit` | ✅ Pre-built |
| **Phi** | `mlx-community/Phi-3-mini-4k-instruct-4bit` | ⚠️ Generic adapter |

---

## 9️⃣ Troubleshooting

### "Unsupported model_type: phi"
```python
# Add a custom adapter for your model
from dflash_mlx.adapters import MLXTargetAdapter, ADAPTERS

class PhiAdapter(MLXTargetAdapter):
    family = "phi"
    # Override methods as needed...

ADAPTERS["phi"] = PhiAdapter
```

### "Vocab size mismatch"
Ensure target model and draft model share the same tokenizer vocabulary. Drafters are trained for specific target families.

### Slow first run
MLX compiles Metal kernels lazily. First generation is slow; subsequent runs are fast. The benchmark method includes warmup.

### Out of memory
- Reduce `--block-size` (default 16 → 8)
- Use 4-bit quantized target models (`-4bit` suffix)
- Reduce `max_tokens`

### Draft tokens all rejected
- Drafter may not match target model (wrong family)
- Use trained drafter for your specific model
- Check `target_layer_ids` alignment in config

---

## 🔟 Full Example Script

Save as `run_dflash.py`:

```python
#!/usr/bin/env python3
"""Complete DFlash example with error handling."""

import sys
from mlx_lm import load
from dflash_mlx import DFlashSpeculativeDecoder
from dflash_mlx.convert import load_mlx_dflash

def main():
    # Configuration
    TARGET_MODEL = "mlx-community/Qwen3-4B-bf16"
    DRAFT_MODEL = "~/models/dflash/Qwen3-4B-DFlash-mlx"
    PROMPT = "Explain how speculative decoding works."
    MAX_TOKENS = 512
    
    print(f"Loading target model: {TARGET_MODEL}")
    model, tokenizer = load(TARGET_MODEL)
    
    print(f"Loading DFlash drafter: {DRAFT_MODEL}")
    try:
        draft_model, draft_config = load_mlx_dflash(DRAFT_MODEL)
    except FileNotFoundError:
        print(f"Error: Drafter not found at {DRAFT_MODEL}")
        print("Convert first: uv run python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ~/models/dflash/Qwen3-4B-DFlash-mlx")
        sys.exit(1)
    
    print("Creating DFlash decoder...")
    decoder = DFlashSpeculativeDecoder(
        target_model=model,
        draft_model=draft_model,
        tokenizer=tokenizer,
        block_size=draft_config.get("block_size", 16),
    )
    
    print(f"\nPrompt: {PROMPT}")
    print("-" * 60)
    
    output = decoder.generate(
        prompt=PROMPT,
        max_tokens=MAX_TOKENS,
        temperature=0.0,
    )
    
    print(output)
    print("-" * 60)
    print("Done!")

if __name__ == "__main__":
    main()
```

Run:
```bash
uv run python run_dflash.py
```

---

## 🔄 Daily Workflow with `uv`

```bash
# cd into your project
cd ~/projects/dflash-mlx-universal

# Run any script — uv handles the virtual env automatically
uv run python examples/qwen3_4b_demo.py

# Run the server
uv run python -m dflash_mlx.serve --target mlx-community/Qwen3-4B-bf16 --draft ./drafter --port 8000

# Run tests
uv run pytest tests/ -v

# Format code
uv run black dflash_mlx/

# Lint
uv run ruff check dflash_mlx/

# Add a dependency
uv add numpy>=1.26.0

# Lock dependencies
uv lock

# Sync environment with lock file
uv sync
```

---

## 📚 Next Steps

1. **Install `uv`** → `brew install uv`
2. **Clone repo** → `git clone https://huggingface.co/tritesh/dflash-mlx-universal.git`
3. **Install** → `cd dflash-mlx-universal && uv pip install -e ".[dev,server]"`
4. **Convert drafter** → `uv run python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ./drafter`
5. **Benchmark** → `uv run python examples/qwen3_4b_demo.py`
6. **Start server** → `uv run python -m dflash_mlx.serve --target ... --draft ...`
7. **Connect tools** → aider, Continue, custom clients
8. **Train custom drafters** → For unsupported models using `UniversalDFlashDecoder`

---

For questions/issues: https://huggingface.co/tritesh/dflash-mlx-universal/discussions