dflash-mlx-universal / USAGE_GUIDE.md

tritesh

Upload USAGE_GUIDE.md

937c2a6 verified 2 days ago

preview code

raw

history blame contribute delete

14.9 kB

DFlash-MLX-Universal: System Usage Guide

How to use dflash-mlx-universal on your Apple Silicon Mac (M1/M2/M3/M4)

📋 Prerequisites

Requirement	Version	Notes
macOS	14+ (Sonoma/Sequoia)	MLX requires Apple Silicon
Python	3.9 - 3.12	Recommend 3.11 or 3.12
Chip	M1/M2/M3/M4 (Pro/Max/Ultra)	Unified memory required for large models
Memory	16GB+ minimum, 32GB+ recommended	96GB for 70B+ models

1️⃣ Installation (Recommended: `uv`)

uv is an extremely fast Python package manager written in Rust. It's the recommended way to install dflash-mlx-universal.

Install `uv` (One-time)

# Option A: Homebrew (macOS)
brew install uv

# Option B: Official installer
curl -LsSf https://astral.sh/uv/install.sh | sh

# Verify
uv --version  # Should show 0.6.x or higher

Install DFlash-MLX-Universal with `uv`

# 1. Clone the repo
git clone https://huggingface.co/tritesh/dflash-mlx-universal.git
cd dflash-mlx-universal

# 2. Create virtual environment with uv (uses .python-version file)
uv venv

# 3. Install in editable mode with all dependencies
uv pip install -e ".[dev,server]"

# Or install directly from the repo
uv pip install "git+https://huggingface.co/tritesh/dflash-mlx-universal.git[dev,server]"

Alternative: `uv` project workflow (no manual venv)

# 1. Enter project directory
cd dflash-mlx-universal

# 2. uv automatically reads pyproject.toml and .python-version
uv run python -c "import dflash_mlx; print(dflash_mlx.__version__)"

# 3. Lock dependencies (creates uv.lock)
uv lock

# 4. Run any script with automatic dependency resolution
uv run python examples/qwen3_4b_demo.py

# 5. Run tests
uv run pytest tests/ -v

# 6. Start server
uv run python -m dflash_mlx.serve --target mlx-community/Qwen3-4B-bf16 --draft ./Qwen3-4B-DFlash-mlx --port 8000

With `uv` and dependency groups

# Install only core dependencies
uv pip install -e .

# Install with server extras (FastAPI + uvicorn)
uv pip install -e ".[server]"

# Install with dev extras (pytest, black, ruff)
uv pip install -e ".[dev]"

# Install everything at once
uv pip install -e ".[dev,server]"

1️⃣-alt Installation (Classic `pip`)

If you prefer pip:

# 1. Create virtual environment
python3 -m venv .venv-dflash
source .venv-dflash/bin/activate  # On zsh/bash

# 2. Upgrade pip
pip install --upgrade pip

# 3. Install core dependencies
pip install mlx-lm>=0.24.0 transformers>=4.57.0 huggingface-hub>=0.25.0

# 4. Install dflash-mlx-universal from your repo
pip install git+https://huggingface.co/tritesh/dflash-mlx-universal.git

# Optional: server mode
pip install fastapi uvicorn

2️⃣ Quick Start — Using a Pre-converted Drafter

Step A: Convert an Official DFlash Drafter to MLX

Official drafters are PyTorch models. You need to convert them to MLX format once:

# With uv (recommended)
uv run python -m dflash_mlx.convert \
    --model z-lab/Qwen3-4B-DFlash-b16 \
    --output ~/models/dflash/Qwen3-4B-DFlash-mlx

# With classic pip
python -m dflash_mlx.convert \
    --model z-lab/Qwen3-4B-DFlash-b16 \
    --output ~/models/dflash/Qwen3-4B-DFlash-mlx

Supported drafters:

# Qwen3 series
uv run python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16  --output ~/models/dflash/Qwen3-4B-DFlash-mlx
uv run python -m dflash_mlx.convert --model z-lab/Qwen3-8B-DFlash-b16  --output ~/models/dflash/Qwen3-8B-DFlash-mlx

# Qwen3.5 series
uv run python -m dflash_mlx.convert --model z-lab/Qwen3.5-9B-DFlash    --output ~/models/dflash/Qwen3.5-9B-DFlash-mlx
uv run python -m dflash_mlx.convert --model z-lab/Qwen3.5-27B-DFlash   --output ~/models/dflash/Qwen3.5-27B-DFlash-mlx

# Qwen3.6 series
uv run python -m dflash_mlx.convert --model z-lab/Qwen3.6-27B-DFlash   --output ~/models/dflash/Qwen3.6-27B-DFlash-mlx
uv run python -m dflash_mlx.convert --model z-lab/Qwen3.6-35B-A3B-DFlash --output ~/models/dflash/Qwen3.6-35B-DFlash-mlx

# LLaMA
uv run python -m dflash_mlx.convert --model z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat --output ~/models/dflash/LLaMA3.1-8B-DFlash-mlx

# Gemma
uv run python -m dflash_mlx.convert --model z-lab/gemma-4-31B-it-DFlash --output ~/models/dflash/gemma-4-31B-DFlash-mlx

# GPT-OSS
uv run python -m dflash_mlx.convert --model z-lab/gpt-oss-20b-DFlash --output ~/models/dflash/gpt-oss-20b-DFlash-mlx

What this does:

Downloads PyTorch weights from HF Hub
Transposes linear layers (PyTorch → MLX format)
Saves as weights.npz + config.json
Creates model_info.json with target model mapping

Step B: Generate with DFlash Speculative Decoding

from mlx_lm import load
from dflash_mlx import DFlashSpeculativeDecoder
from dflash_mlx.convert import load_mlx_dflash

# 1. Load target model (any MLX-converted model)
model, tokenizer = load("mlx-community/Qwen3-4B-bf16")

# 2. Load converted DFlash drafter
draft_model, draft_config = load_mlx_dflash("~/models/dflash/Qwen3-4B-DFlash-mlx")

# 3. Create decoder (auto-detects architecture via adapters)
decoder = DFlashSpeculativeDecoder(
    target_model=model,
    draft_model=draft_model,
    tokenizer=tokenizer,
    block_size=draft_config.get("block_size", 16),
)

# 4. Generate with 6× speedup
output = decoder.generate(
    prompt="Write a Python function to implement quicksort.",
    max_tokens=1024,
    temperature=0.0,  # Greedy for exact reproduction
)
print(output)

Run with uv:

uv run python my_generate_script.py

Expected output:

[DFlash] Prefill: processing 12 prompt tokens...
[DFlash] Starting speculative decoding (block_size=16)...
[DFlash] Done. Generated 1024 tokens, avg acceptance: 6.23, effective speedup: ~5.8x

3️⃣ Streaming Generation

For real-time output (chat UI, etc.):

from mlx_lm import load
from dflash_mlx import DFlashSpeculativeDecoder
from dflash_mlx.convert import load_mlx_dflash

model, tokenizer = load("mlx-community/Qwen3-4B-bf16")
draft_model, _ = load_mlx_dflash("~/models/dflash/Qwen3-4B-DFlash-mlx")
decoder = DFlashSpeculativeDecoder(model, draft_model, tokenizer, block_size=16)

# Generator-based streaming
for chunk in decoder.generate(
    prompt="Tell me a story about a robot.",
    max_tokens=512,
    stream=True,  # ← Returns generator
):
    print(chunk, end="", flush=True)

4️⃣ Benchmark Mode

Compare DFlash vs baseline speed:

from mlx_lm import load
from dflash_mlx import DFlashSpeculativeDecoder
from dflash_mlx.convert import load_mlx_dflash

model, tokenizer = load("mlx-community/Qwen3-4B-bf16")
draft_model, _ = load_mlx_dflash("~/models/dflash/Qwen3-4B-DFlash-mlx")
decoder = DFlashSpeculativeDecoder(model, draft_model, tokenizer, block_size=16)

# Run benchmark
results = decoder.benchmark(
    prompt="Write a quicksort in Python.",
    max_tokens=512,
    num_runs=5,
)

print(f"Speedup: {results['speedup']:.2f}x")
print(f"Tokens/sec: {results['tokens_per_sec']:.1f}")

Run:

uv run python benchmark_script.py

Sample results (M2 Pro Max 96GB):

[Benchmark] Baseline: 2.34s | DFlash: 0.41s | Speedup: 5.71x | 1247.6 tok/s

5️⃣ Universal Decoder (Any Model Without Pre-built Drafter)

If your model doesn't have a DFlash drafter yet:

from mlx_lm import load
from dflash_mlx.universal import UniversalDFlashDecoder

# Load ANY mlx_lm model
model, tokenizer = load("mlx-community/Llama-3.1-8B-Instruct-4bit")

# UniversalDFlashDecoder:
# 1. Auto-detects architecture (LLaMA in this case)
# 2. Creates a generic 5-layer drafter (~500MB)
# 3. Sets up proper adapter for hidden state extraction

decoder = UniversalDFlashDecoder(
    target_model=model,
    tokenizer=tokenizer,
    draft_layers=5,
    draft_hidden_size=1024,
    block_size=16,
)

# Option A: Train a custom drafter (2-8 hours on Apple Silicon)
decoder.train_drafter(
    dataset="open-web-math",  # or local JSONL
    epochs=6,
    lr=6e-4,
    batch_size=16,
    output_path="~/models/dflash/my-llama-drafter",
)

# Option B: Use untrained (low quality, for testing only)
output = decoder.generate(
    prompt="Hello world!",
    max_tokens=100,
)

6️⃣ OpenAI-Compatible Server

Run a local server compatible with OpenAI clients:

# With uv (recommended)
uv run python -m dflash_mlx.serve \
    --target mlx-community/Qwen3-4B-bf16 \
    --draft ~/models/dflash/Qwen3-4B-DFlash-mlx \
    --block-size 16 \
    --port 8000

# Or in background
nohup uv run python -m dflash_mlx.serve \
    --target mlx-community/Qwen3-4B-bf16 \
    --draft ~/models/dflash/Qwen3-4B-DFlash-mlx \
    --port 8000 > dflash.log 2>&1 &

Query the server

# Chat completion
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-4b",
    "messages": [{"role": "user", "content": "Explain quantum computing"}],
    "max_tokens": 512,
    "temperature": 0.0
  }'

# Streaming
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-4b",
    "messages": [{"role": "user", "content": "Count to 10"}],
    "max_tokens": 100,
    "stream": true
  }'

# Check metrics
curl http://localhost:8000/metrics

Python client

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",  # Local server, no auth
)

response = client.chat.completions.create(
    model="qwen3-4b",
    messages=[{"role": "user", "content": "Write a haiku about ML"}],
    max_tokens=100,
)
print(response.choices[0].message.content)

7️⃣ Using with Ollama, aider, Continue, etc.

Any OpenAI-compatible client works:

aider (AI coding assistant)

aider --model openai/qwen3-4b \
    --openai-api-base http://localhost:8000/v1 \
    --openai-api-key not-needed

Continue.dev (VS Code extension)

// .continue/config.json
{
  "models": [{
    "title": "DFlash Qwen3-4B",
    "provider": "openai",
    "model": "qwen3-4b",
    "apiBase": "http://localhost:8000/v1",
    "apiKey": "not-needed"
  }]
}

Ollama (as custom endpoint)

Configure any OpenAI-compatible frontend to point at http://localhost:8000/v1

8️⃣ Supported Model Families

Family	Target Model Example	Drafter Status
Qwen3	`mlx-community/Qwen3-4B-bf16`	✅ Pre-built
Qwen3.5	`mlx-community/Qwen3.5-9B-4bit`	✅ Pre-built
Qwen3.6	`mlx-community/Qwen3.6-27B-4bit`	✅ Pre-built
LLaMA 3.1	`mlx-community/Llama-3.1-8B-Instruct-4bit`	✅ Pre-built
LLaMA 3.3	`mlx-community/Llama-3.3-70B-Instruct-4bit`	✅ Pre-built
Mistral	`mlx-community/Mistral-7B-Instruct-v0.3-4bit`	⚠️ Train custom
Gemma	`mlx-community/gemma-4-31b-it-4bit`	✅ Pre-built
Phi	`mlx-community/Phi-3-mini-4k-instruct-4bit`	⚠️ Generic adapter

9️⃣ Troubleshooting

"Unsupported model_type: phi"

# Add a custom adapter for your model
from dflash_mlx.adapters import MLXTargetAdapter, ADAPTERS

class PhiAdapter(MLXTargetAdapter):
    family = "phi"
    # Override methods as needed...

ADAPTERS["phi"] = PhiAdapter

"Vocab size mismatch"

Ensure target model and draft model share the same tokenizer vocabulary. Drafters are trained for specific target families.

Slow first run

MLX compiles Metal kernels lazily. First generation is slow; subsequent runs are fast. The benchmark method includes warmup.

Out of memory

Reduce --block-size (default 16 → 8)
Use 4-bit quantized target models (-4bit suffix)
Reduce max_tokens

Draft tokens all rejected

Drafter may not match target model (wrong family)
Use trained drafter for your specific model
Check target_layer_ids alignment in config

🔟 Full Example Script

Save as run_dflash.py:

#!/usr/bin/env python3
"""Complete DFlash example with error handling."""

import sys
from mlx_lm import load
from dflash_mlx import DFlashSpeculativeDecoder
from dflash_mlx.convert import load_mlx_dflash

def main():
    # Configuration
    TARGET_MODEL = "mlx-community/Qwen3-4B-bf16"
    DRAFT_MODEL = "~/models/dflash/Qwen3-4B-DFlash-mlx"
    PROMPT = "Explain how speculative decoding works."
    MAX_TOKENS = 512
    
    print(f"Loading target model: {TARGET_MODEL}")
    model, tokenizer = load(TARGET_MODEL)
    
    print(f"Loading DFlash drafter: {DRAFT_MODEL}")
    try:
        draft_model, draft_config = load_mlx_dflash(DRAFT_MODEL)
    except FileNotFoundError:
        print(f"Error: Drafter not found at {DRAFT_MODEL}")
        print("Convert first: uv run python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ~/models/dflash/Qwen3-4B-DFlash-mlx")
        sys.exit(1)
    
    print("Creating DFlash decoder...")
    decoder = DFlashSpeculativeDecoder(
        target_model=model,
        draft_model=draft_model,
        tokenizer=tokenizer,
        block_size=draft_config.get("block_size", 16),
    )
    
    print(f"\nPrompt: {PROMPT}")
    print("-" * 60)
    
    output = decoder.generate(
        prompt=PROMPT,
        max_tokens=MAX_TOKENS,
        temperature=0.0,
    )
    
    print(output)
    print("-" * 60)
    print("Done!")

if __name__ == "__main__":
    main()