dflash-mlx-universal / USAGE_GUIDE.md
tritesh's picture
Upload USAGE_GUIDE.md
937c2a6 verified

DFlash-MLX-Universal: System Usage Guide

How to use dflash-mlx-universal on your Apple Silicon Mac (M1/M2/M3/M4)


πŸ“‹ Prerequisites

Requirement Version Notes
macOS 14+ (Sonoma/Sequoia) MLX requires Apple Silicon
Python 3.9 - 3.12 Recommend 3.11 or 3.12
Chip M1/M2/M3/M4 (Pro/Max/Ultra) Unified memory required for large models
Memory 16GB+ minimum, 32GB+ recommended 96GB for 70B+ models

1️⃣ Installation (Recommended: uv)

uv is an extremely fast Python package manager written in Rust. It's the recommended way to install dflash-mlx-universal.

Install uv (One-time)

# Option A: Homebrew (macOS)
brew install uv

# Option B: Official installer
curl -LsSf https://astral.sh/uv/install.sh | sh

# Verify
uv --version  # Should show 0.6.x or higher

Install DFlash-MLX-Universal with uv

# 1. Clone the repo
git clone https://huggingface.co/tritesh/dflash-mlx-universal.git
cd dflash-mlx-universal

# 2. Create virtual environment with uv (uses .python-version file)
uv venv

# 3. Install in editable mode with all dependencies
uv pip install -e ".[dev,server]"

# Or install directly from the repo
uv pip install "git+https://huggingface.co/tritesh/dflash-mlx-universal.git[dev,server]"

Alternative: uv project workflow (no manual venv)

# 1. Enter project directory
cd dflash-mlx-universal

# 2. uv automatically reads pyproject.toml and .python-version
uv run python -c "import dflash_mlx; print(dflash_mlx.__version__)"

# 3. Lock dependencies (creates uv.lock)
uv lock

# 4. Run any script with automatic dependency resolution
uv run python examples/qwen3_4b_demo.py

# 5. Run tests
uv run pytest tests/ -v

# 6. Start server
uv run python -m dflash_mlx.serve --target mlx-community/Qwen3-4B-bf16 --draft ./Qwen3-4B-DFlash-mlx --port 8000

With uv and dependency groups

# Install only core dependencies
uv pip install -e .

# Install with server extras (FastAPI + uvicorn)
uv pip install -e ".[server]"

# Install with dev extras (pytest, black, ruff)
uv pip install -e ".[dev]"

# Install everything at once
uv pip install -e ".[dev,server]"

1️⃣-alt Installation (Classic pip)

If you prefer pip:

# 1. Create virtual environment
python3 -m venv .venv-dflash
source .venv-dflash/bin/activate  # On zsh/bash

# 2. Upgrade pip
pip install --upgrade pip

# 3. Install core dependencies
pip install mlx-lm>=0.24.0 transformers>=4.57.0 huggingface-hub>=0.25.0

# 4. Install dflash-mlx-universal from your repo
pip install git+https://huggingface.co/tritesh/dflash-mlx-universal.git

# Optional: server mode
pip install fastapi uvicorn

2️⃣ Quick Start β€” Using a Pre-converted Drafter

Step A: Convert an Official DFlash Drafter to MLX

Official drafters are PyTorch models. You need to convert them to MLX format once:

# With uv (recommended)
uv run python -m dflash_mlx.convert \
    --model z-lab/Qwen3-4B-DFlash-b16 \
    --output ~/models/dflash/Qwen3-4B-DFlash-mlx

# With classic pip
python -m dflash_mlx.convert \
    --model z-lab/Qwen3-4B-DFlash-b16 \
    --output ~/models/dflash/Qwen3-4B-DFlash-mlx

Supported drafters:

# Qwen3 series
uv run python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16  --output ~/models/dflash/Qwen3-4B-DFlash-mlx
uv run python -m dflash_mlx.convert --model z-lab/Qwen3-8B-DFlash-b16  --output ~/models/dflash/Qwen3-8B-DFlash-mlx

# Qwen3.5 series
uv run python -m dflash_mlx.convert --model z-lab/Qwen3.5-9B-DFlash    --output ~/models/dflash/Qwen3.5-9B-DFlash-mlx
uv run python -m dflash_mlx.convert --model z-lab/Qwen3.5-27B-DFlash   --output ~/models/dflash/Qwen3.5-27B-DFlash-mlx

# Qwen3.6 series
uv run python -m dflash_mlx.convert --model z-lab/Qwen3.6-27B-DFlash   --output ~/models/dflash/Qwen3.6-27B-DFlash-mlx
uv run python -m dflash_mlx.convert --model z-lab/Qwen3.6-35B-A3B-DFlash --output ~/models/dflash/Qwen3.6-35B-DFlash-mlx

# LLaMA
uv run python -m dflash_mlx.convert --model z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat --output ~/models/dflash/LLaMA3.1-8B-DFlash-mlx

# Gemma
uv run python -m dflash_mlx.convert --model z-lab/gemma-4-31B-it-DFlash --output ~/models/dflash/gemma-4-31B-DFlash-mlx

# GPT-OSS
uv run python -m dflash_mlx.convert --model z-lab/gpt-oss-20b-DFlash --output ~/models/dflash/gpt-oss-20b-DFlash-mlx

What this does:

  • Downloads PyTorch weights from HF Hub
  • Transposes linear layers (PyTorch β†’ MLX format)
  • Saves as weights.npz + config.json
  • Creates model_info.json with target model mapping

Step B: Generate with DFlash Speculative Decoding

from mlx_lm import load
from dflash_mlx import DFlashSpeculativeDecoder
from dflash_mlx.convert import load_mlx_dflash

# 1. Load target model (any MLX-converted model)
model, tokenizer = load("mlx-community/Qwen3-4B-bf16")

# 2. Load converted DFlash drafter
draft_model, draft_config = load_mlx_dflash("~/models/dflash/Qwen3-4B-DFlash-mlx")

# 3. Create decoder (auto-detects architecture via adapters)
decoder = DFlashSpeculativeDecoder(
    target_model=model,
    draft_model=draft_model,
    tokenizer=tokenizer,
    block_size=draft_config.get("block_size", 16),
)

# 4. Generate with 6Γ— speedup
output = decoder.generate(
    prompt="Write a Python function to implement quicksort.",
    max_tokens=1024,
    temperature=0.0,  # Greedy for exact reproduction
)
print(output)

Run with uv:

uv run python my_generate_script.py

Expected output:

[DFlash] Prefill: processing 12 prompt tokens...
[DFlash] Starting speculative decoding (block_size=16)...
[DFlash] Done. Generated 1024 tokens, avg acceptance: 6.23, effective speedup: ~5.8x

3️⃣ Streaming Generation

For real-time output (chat UI, etc.):

from mlx_lm import load
from dflash_mlx import DFlashSpeculativeDecoder
from dflash_mlx.convert import load_mlx_dflash

model, tokenizer = load("mlx-community/Qwen3-4B-bf16")
draft_model, _ = load_mlx_dflash("~/models/dflash/Qwen3-4B-DFlash-mlx")
decoder = DFlashSpeculativeDecoder(model, draft_model, tokenizer, block_size=16)

# Generator-based streaming
for chunk in decoder.generate(
    prompt="Tell me a story about a robot.",
    max_tokens=512,
    stream=True,  # ← Returns generator
):
    print(chunk, end="", flush=True)

4️⃣ Benchmark Mode

Compare DFlash vs baseline speed:

from mlx_lm import load
from dflash_mlx import DFlashSpeculativeDecoder
from dflash_mlx.convert import load_mlx_dflash

model, tokenizer = load("mlx-community/Qwen3-4B-bf16")
draft_model, _ = load_mlx_dflash("~/models/dflash/Qwen3-4B-DFlash-mlx")
decoder = DFlashSpeculativeDecoder(model, draft_model, tokenizer, block_size=16)

# Run benchmark
results = decoder.benchmark(
    prompt="Write a quicksort in Python.",
    max_tokens=512,
    num_runs=5,
)

print(f"Speedup: {results['speedup']:.2f}x")
print(f"Tokens/sec: {results['tokens_per_sec']:.1f}")

Run:

uv run python benchmark_script.py

Sample results (M2 Pro Max 96GB):

[Benchmark] Baseline: 2.34s | DFlash: 0.41s | Speedup: 5.71x | 1247.6 tok/s

5️⃣ Universal Decoder (Any Model Without Pre-built Drafter)

If your model doesn't have a DFlash drafter yet:

from mlx_lm import load
from dflash_mlx.universal import UniversalDFlashDecoder

# Load ANY mlx_lm model
model, tokenizer = load("mlx-community/Llama-3.1-8B-Instruct-4bit")

# UniversalDFlashDecoder:
# 1. Auto-detects architecture (LLaMA in this case)
# 2. Creates a generic 5-layer drafter (~500MB)
# 3. Sets up proper adapter for hidden state extraction

decoder = UniversalDFlashDecoder(
    target_model=model,
    tokenizer=tokenizer,
    draft_layers=5,
    draft_hidden_size=1024,
    block_size=16,
)

# Option A: Train a custom drafter (2-8 hours on Apple Silicon)
decoder.train_drafter(
    dataset="open-web-math",  # or local JSONL
    epochs=6,
    lr=6e-4,
    batch_size=16,
    output_path="~/models/dflash/my-llama-drafter",
)

# Option B: Use untrained (low quality, for testing only)
output = decoder.generate(
    prompt="Hello world!",
    max_tokens=100,
)

6️⃣ OpenAI-Compatible Server

Run a local server compatible with OpenAI clients:

# With uv (recommended)
uv run python -m dflash_mlx.serve \
    --target mlx-community/Qwen3-4B-bf16 \
    --draft ~/models/dflash/Qwen3-4B-DFlash-mlx \
    --block-size 16 \
    --port 8000

# Or in background
nohup uv run python -m dflash_mlx.serve \
    --target mlx-community/Qwen3-4B-bf16 \
    --draft ~/models/dflash/Qwen3-4B-DFlash-mlx \
    --port 8000 > dflash.log 2>&1 &

Query the server

# Chat completion
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-4b",
    "messages": [{"role": "user", "content": "Explain quantum computing"}],
    "max_tokens": 512,
    "temperature": 0.0
  }'

# Streaming
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-4b",
    "messages": [{"role": "user", "content": "Count to 10"}],
    "max_tokens": 100,
    "stream": true
  }'

# Check metrics
curl http://localhost:8000/metrics

Python client

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",  # Local server, no auth
)

response = client.chat.completions.create(
    model="qwen3-4b",
    messages=[{"role": "user", "content": "Write a haiku about ML"}],
    max_tokens=100,
)
print(response.choices[0].message.content)

7️⃣ Using with Ollama, aider, Continue, etc.

Any OpenAI-compatible client works:

aider (AI coding assistant)

aider --model openai/qwen3-4b \
    --openai-api-base http://localhost:8000/v1 \
    --openai-api-key not-needed

Continue.dev (VS Code extension)

// .continue/config.json
{
  "models": [{
    "title": "DFlash Qwen3-4B",
    "provider": "openai",
    "model": "qwen3-4b",
    "apiBase": "http://localhost:8000/v1",
    "apiKey": "not-needed"
  }]
}

Ollama (as custom endpoint)

Configure any OpenAI-compatible frontend to point at http://localhost:8000/v1


8️⃣ Supported Model Families

Family Target Model Example Drafter Status
Qwen3 mlx-community/Qwen3-4B-bf16 βœ… Pre-built
Qwen3.5 mlx-community/Qwen3.5-9B-4bit βœ… Pre-built
Qwen3.6 mlx-community/Qwen3.6-27B-4bit βœ… Pre-built
LLaMA 3.1 mlx-community/Llama-3.1-8B-Instruct-4bit βœ… Pre-built
LLaMA 3.3 mlx-community/Llama-3.3-70B-Instruct-4bit βœ… Pre-built
Mistral mlx-community/Mistral-7B-Instruct-v0.3-4bit ⚠️ Train custom
Gemma mlx-community/gemma-4-31b-it-4bit βœ… Pre-built
Phi mlx-community/Phi-3-mini-4k-instruct-4bit ⚠️ Generic adapter

9️⃣ Troubleshooting

"Unsupported model_type: phi"

# Add a custom adapter for your model
from dflash_mlx.adapters import MLXTargetAdapter, ADAPTERS

class PhiAdapter(MLXTargetAdapter):
    family = "phi"
    # Override methods as needed...

ADAPTERS["phi"] = PhiAdapter

"Vocab size mismatch"

Ensure target model and draft model share the same tokenizer vocabulary. Drafters are trained for specific target families.

Slow first run

MLX compiles Metal kernels lazily. First generation is slow; subsequent runs are fast. The benchmark method includes warmup.

Out of memory

  • Reduce --block-size (default 16 β†’ 8)
  • Use 4-bit quantized target models (-4bit suffix)
  • Reduce max_tokens

Draft tokens all rejected

  • Drafter may not match target model (wrong family)
  • Use trained drafter for your specific model
  • Check target_layer_ids alignment in config

πŸ”Ÿ Full Example Script

Save as run_dflash.py:

#!/usr/bin/env python3
"""Complete DFlash example with error handling."""

import sys
from mlx_lm import load
from dflash_mlx import DFlashSpeculativeDecoder
from dflash_mlx.convert import load_mlx_dflash

def main():
    # Configuration
    TARGET_MODEL = "mlx-community/Qwen3-4B-bf16"
    DRAFT_MODEL = "~/models/dflash/Qwen3-4B-DFlash-mlx"
    PROMPT = "Explain how speculative decoding works."
    MAX_TOKENS = 512
    
    print(f"Loading target model: {TARGET_MODEL}")
    model, tokenizer = load(TARGET_MODEL)
    
    print(f"Loading DFlash drafter: {DRAFT_MODEL}")
    try:
        draft_model, draft_config = load_mlx_dflash(DRAFT_MODEL)
    except FileNotFoundError:
        print(f"Error: Drafter not found at {DRAFT_MODEL}")
        print("Convert first: uv run python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ~/models/dflash/Qwen3-4B-DFlash-mlx")
        sys.exit(1)
    
    print("Creating DFlash decoder...")
    decoder = DFlashSpeculativeDecoder(
        target_model=model,
        draft_model=draft_model,
        tokenizer=tokenizer,
        block_size=draft_config.get("block_size", 16),
    )
    
    print(f"\nPrompt: {PROMPT}")
    print("-" * 60)
    
    output = decoder.generate(
        prompt=PROMPT,
        max_tokens=MAX_TOKENS,
        temperature=0.0,
    )
    
    print(output)
    print("-" * 60)
    print("Done!")

if __name__ == "__main__":
    main()

Run:

uv run python run_dflash.py

πŸ”„ Daily Workflow with uv

# cd into your project
cd ~/projects/dflash-mlx-universal

# Run any script β€” uv handles the virtual env automatically
uv run python examples/qwen3_4b_demo.py

# Run the server
uv run python -m dflash_mlx.serve --target mlx-community/Qwen3-4B-bf16 --draft ./drafter --port 8000

# Run tests
uv run pytest tests/ -v

# Format code
uv run black dflash_mlx/

# Lint
uv run ruff check dflash_mlx/

# Add a dependency
uv add numpy>=1.26.0

# Lock dependencies
uv lock

# Sync environment with lock file
uv sync

πŸ“š Next Steps

  1. Install uv β†’ brew install uv
  2. Clone repo β†’ git clone https://huggingface.co/tritesh/dflash-mlx-universal.git
  3. Install β†’ cd dflash-mlx-universal && uv pip install -e ".[dev,server]"
  4. Convert drafter β†’ uv run python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ./drafter
  5. Benchmark β†’ uv run python examples/qwen3_4b_demo.py
  6. Start server β†’ uv run python -m dflash_mlx.serve --target ... --draft ...
  7. Connect tools β†’ aider, Continue, custom clients
  8. Train custom drafters β†’ For unsupported models using UniversalDFlashDecoder

For questions/issues: https://huggingface.co/tritesh/dflash-mlx-universal/discussions