README: Phosphene one-click banner + edit support clarified

33c7a00 verified 11 days ago

12.4 kB

license: mit
base_model: HiDream-ai/HiDream-O1-Image-Dev
tags:
  - mlx
  - mlx-vlm
  - hidream
  - text-to-image
  - apple-silicon
  - bf16
language:
  - en
pipeline_tag: text-to-image
library_name: mlx
inference: false
authors:
  - Mrbizarro

HiDream-O1-Image-Dev — MLX port for Apple Silicon

Ported by Mrbizarro · MIT licensed · published to mlx-community

🎛️ Run it one-click in Phosphene

Phosphene is a free local generative-video panel for Apple Silicon (Mac, M1+). It ships with HiDream-O1 wired into its Image Studio — pick "HiDream-O1-Image-Dev BF16" from the engine dropdown and you have native edit + multi-reference support out of the box. No conda, no Python tinkering, no separate venv setup. Install Pinokio, then in Pinokio install Phosphene.

A native MLX port of HiDream-ai/HiDream-O1-Image-Dev for fast local image generation on Apple Silicon Macs. No PyTorch, no CUDA, no flash-attn required at inference time.

Capabilities (all native to HiDream-O1, all working in this port):

Text-to-image at 1024×1024 / 2048×2048 / non-square trained dims
Instruction-based image edit with 1 reference image (e.g. "change the chef's white jacket to red" — preserves scene, pose, identity)
Multi-reference subject personalization with 2-3 reference images (compose multiple subjects in a new scene)

HiDream-O1 is an 8B Qwen3-VL-based unified pixel-patch transformer — it predicts raw 32×32 RGB patches directly through the same backbone that handles text, with no separate VAE. The Dev variant is a 28-step distillation of the 50-step Full model, released under the MIT license.

This port:

Reuses mlx-vlm's Qwen3-VL backbone (vision tower, decoder layers, mrope-3D)
Adds the three diffusion-side custom heads (t_embedder1, x_embedder, final_layer2)
Ports the FlashFlowMatchEulerDiscreteScheduler and the unified-token-sequence builder
Ships BF16 weights (no quantization — see "Why BF16" below)

Hero samples

All generated by the included generator script on a 64 GB Mac Studio. Click any image to open full-resolution.


Construction worker on a rainy rooftop, Kodak Tri-X B&W. 2048×2048, BF16, 213s.	Elderly Japanese tea master holding a ceramic cup. 1024×1024, Q6 (showcase), 36s.

Tropical beach with turquoise water and palms. 1024×1024, Q8, 67s.	Candid morning portrait, woman with coffee + toast, soft window light. 1440×2560, BF16, 127s.

Astronaut in space-station corridor, anamorphic lens flare. 2560×1440, BF16, 187s.	Snow-capped mountain peak at sunset. 2048×2048, Q4 (early), 236s.

Alice in cyberpunk, neon Cheshire cat hologram. 2048×2048, Q8, 276s.	Fitness influencer mid-deadlift in industrial gym. 1440×2560, BF16, 127s.

More: sample_outputs/hero/.

Variants

Variant	Repo	Backbone size	RAM (1024)	Quality
BF16 (this repo)	`mlx-community/HiDream-O1-Image-Dev-mlx-bf16`	17.5 GB	16 GB	✅ Clean across all trained dims
Q8	`mlx-community/HiDream-O1-Image-Dev-mlx-q8`	10 GB	11.5 GB	⚠ Clean at square dims, grid at non-square
Q6	`mlx-community/HiDream-O1-Image-Dev-mlx-q6`	8 GB	8.5 GB	⚠ Clean at square dims, grid at non-square

Q4 was tested and rejected — brightness collapses, every image ships dark.

Why BF16 is the safe default

Per-group dequantization rounding (Q6/Q8) compounds across the 36 decoder layers and shows as a visible 32-pixel grid in flat regions (skies, walls, water), specifically at non-square trained dimensions like 1440×2560 or 3104×1312. BF16 matches the upstream's torch_dtype=torch.float32 + autocast(bfloat16) precision and is the only quant clean across all trained dimensions.

If your workflow is square-only (1024×1024, 2048×2048) and you're RAM-constrained, Q6 is half the size and 2× faster — no quality loss at those dims. Use Q6 on a 16 GB Mac, BF16 on 32 GB+.

Install

Requires macOS on Apple Silicon (M1 or newer). Tested on macOS 14+ with a 64 GB Mac Studio.

Quick start (download pre-converted weights — recommended)

# Clone the repo (code, docs, samples)
hf download mlx-community/HiDream-O1-Image-Dev-mlx-bf16 --local-dir hidream-o1-mlx
cd hidream-o1-mlx

# Set up the venv
uv venv --python 3.11
uv pip install -r requirements.txt

# Generate (model files are at the repo root — pass --model-path .)
.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
  --model-path . \
  --prompt "your prompt here" \
  --output out.png

Or convert from upstream weights yourself

git clone https://huggingface.co/mlx-community/HiDream-O1-Image-Dev-mlx-bf16
cd HiDream-O1-Image-Dev-mlx-bf16
uv venv --python 3.11
uv pip install -r requirements.txt

# Convert the upstream HF weights to MLX BF16 (~5 minutes, requires ~50 GB free disk)
.venv/bin/python scripts/hidream_o1/convert_hidream_o1_to_mlx.py \
  --hf-source HiDream-ai/HiDream-O1-Image-Dev \
  --out-dir mlx_models/hidream-o1-dev-bf16 \
  --bits 16

Usage

# Single image, default 1024×1024 BF16
.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
  --model-path mlx_models/hidream-o1-dev-bf16 \
  --prompt "your prompt here" \
  --output sample_outputs/whatever.png \
  --seed 42

# Higher resolution (2048×2048 = upstream default)
.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
  --model-path mlx_models/hidream-o1-dev-bf16 \
  --prompt "..." \
  --width 2048 --height 2048 \
  --output sample_outputs/big.png

# Vertical / cinema (auto-snaps to nearest trained ratio)
.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
  --model-path mlx_models/hidream-o1-dev-bf16 \
  --prompt "..." \
  --width 1440 --height 2560 \
  --output sample_outputs/portrait.png

# Instruction-based edit (one ref image)
.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
  --model-path mlx_models/hidream-o1-dev-bf16 \
  --prompt "change the chef's white jacket to a bright red chef jacket, same kitchen, same pose, photorealistic" \
  --output sample_outputs/edit_red_jacket.png \
  --ref-images /path/to/chef.jpg \
  --seed 42

# Multi-reference subject personalization (2-3 refs)
.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
  --model-path mlx_models/hidream-o1-dev-bf16 \
  --prompt "the person from reference 1 standing in the location from reference 2, golden hour, photorealistic" \
  --output sample_outputs/multi_ref.png \
  --ref-images /path/to/person.jpg /path/to/place.jpg \
  --seed 42

Trained resolutions

HiDream-O1 was trained on a fixed list of resolutions. The generator auto-snaps to the closest. Off-spec dims produce visible patch artifacts. The trained list:

2048×2048, 2304×1728, 1728×2304, 2560×1440, 1440×2560,
2496×1664, 1664×2496, 3104×1312, 1312×3104, 2304×1792, 1792×2304

Prompt tips for realism

HiDream is responsive to camera/film terminology. To avoid the AI-glossy look:

Lead with masterpiece, best quality (community-found responder phrase)
Subject + Actions → Setting → Style → Details ordering
Specify equipment: Leica M6 with Kodak Tri-X 400, Pentax K1000 + Cinestill 800T, Hasselblad H6D medium format
Reference real photographers: Sebastião Salgado, Saul Leiter, Wim Wenders, Annie Leibovitz, Anders Petersen
Spell out skin imperfection: "natural pores", "faint laugh lines", "weathered hands", "no retouching"
Avoid "stunning", "perfect", "beautiful" — they push toward AI-glamour aesthetics

The Dev model uses guidance_scale=0.0 so negative prompts have no effect — push positive prompts harder instead.

What's in this repo

hidream-o1-mlx/
├── README.md                                 (this file)
├── LICENSE                                   (MIT)
├── requirements.txt                          (mlx-vlm 0.5.0, transformers 5.8+, deps)
├── scripts/hidream_o1/
│   ├── convert_hidream_o1_to_mlx.py          (HF → MLX, BF16 / Q4 / Q6 / Q8)
│   ├── generate_hidream_o1_mlx.py            (T2I generator + experimental edit/multi-ref)
│   ├── hidream_model.py                      (custom heads + forward_generation)
│   ├── pipeline_helpers.py                   (T2I sample, mrope, mask, patchify)
│   └── flow_match.py                         (FlashFlowMatchScheduler in MLX)
├── docs/
│   ├── EVALUATION.md                         (perf + quality findings, A/B vs mflux)
│   ├── HIDREAM_O1_MLX_PORT_REPORT.md         (architecture + weight conversion details)
│   └── PHOSPHENE_INTEGRATION_PLAN.md         (how it slots into a host app)
├── sample_outputs/                           (gallery)
└── mlx_models/                               (where converted weights land)

Performance

Resolution	Per step	Total (28 steps)	Peak RAM
1024×1024	2.4 s	67 s	16 GB
1440×2560	4.5 s	127 s	16 GB
2048×2048	6.7 s	187 s	16 GB
3104×1312	7.6 s	213 s	16 GB

mx.compile gives 0% speedup — the inference loop is bandwidth-bound on the 36-layer BF16 decoder. To go faster you'd need a smaller distillation (none public) or text-cache reuse across denoising steps.

Status

✅ Text-to-image: production-quality, BF16 default, ~67 s / 1024×1024 on a 64 GB Mac
✅ Instruction edit (K=1 ref): working at BF16. Verified: same chef, same kitchen, same pose, only the jacket colour changed.
✅ Multi-reference subject personalization (K=2-3 refs): supported by the upstream architecture and our port; same --ref-images flag with multiple paths
✅ Native MLX — no PyTorch, no CUDA, no flash-attn at inference time
⚠ Edit requires BF16. Q6/Q8 quantization breaks the attention against ref features (degenerate output). The text-to-image path is fine at all quants.

Acknowledgements

HiDream-ai for the original HiDream-O1-Image model + MIT license
Blaizzy/mlx-vlm for the Qwen3-VL MLX backbone (this port reuses their vision tower + decoder layers + mrope-3D wholesale)
Apple ml-explore/mlx for the MLX framework
The Civitai community's HiDream prompt-engineering guide

Citation

If you use this in research, cite the upstream model:

@misc{hidream-o1-image,
  author = {HiDream-ai},
  title = {HiDream-O1-Image: Pixel-Level Unified Transformer},
  year = {2026},
  url = {https://github.com/HiDream-ai/HiDream-O1-Image}
}

License

MIT — see LICENSE.