Mrbizarro's picture
README: Phosphene one-click banner + edit support clarified
33c7a00 verified
metadata
license: mit
base_model: HiDream-ai/HiDream-O1-Image-Dev
tags:
  - mlx
  - mlx-vlm
  - hidream
  - text-to-image
  - apple-silicon
  - bf16
language:
  - en
pipeline_tag: text-to-image
library_name: mlx
inference: false
authors:
  - Mrbizarro

HiDream-O1-Image-Dev β€” MLX port for Apple Silicon

Ported by Mrbizarro Β· MIT licensed Β· published to mlx-community

πŸŽ›οΈ Run it one-click in Phosphene

Phosphene is a free local generative-video panel for Apple Silicon (Mac, M1+). It ships with HiDream-O1 wired into its Image Studio β€” pick "HiDream-O1-Image-Dev BF16" from the engine dropdown and you have native edit + multi-reference support out of the box. No conda, no Python tinkering, no separate venv setup. Install Pinokio, then in Pinokio install Phosphene.


A native MLX port of HiDream-ai/HiDream-O1-Image-Dev for fast local image generation on Apple Silicon Macs. No PyTorch, no CUDA, no flash-attn required at inference time.

Capabilities (all native to HiDream-O1, all working in this port):

  • Text-to-image at 1024Γ—1024 / 2048Γ—2048 / non-square trained dims
  • Instruction-based image edit with 1 reference image (e.g. "change the chef's white jacket to red" β€” preserves scene, pose, identity)
  • Multi-reference subject personalization with 2-3 reference images (compose multiple subjects in a new scene)

HiDream-O1 is an 8B Qwen3-VL-based unified pixel-patch transformer β€” it predicts raw 32Γ—32 RGB patches directly through the same backbone that handles text, with no separate VAE. The Dev variant is a 28-step distillation of the 50-step Full model, released under the MIT license.

This port:

  • Reuses mlx-vlm's Qwen3-VL backbone (vision tower, decoder layers, mrope-3D)
  • Adds the three diffusion-side custom heads (t_embedder1, x_embedder, final_layer2)
  • Ports the FlashFlowMatchEulerDiscreteScheduler and the unified-token-sequence builder
  • Ships BF16 weights (no quantization β€” see "Why BF16" below)

Hero samples

All generated by the included generator script on a 64 GB Mac Studio. Click any image to open full-resolution.

Construction worker on a rainy rooftop, Kodak Tri-X B&W. 2048Γ—2048, BF16, 213s. Elderly Japanese tea master holding a ceramic cup. 1024Γ—1024, Q6 (showcase), 36s.
Tropical beach with turquoise water and palms. 1024Γ—1024, Q8, 67s. Candid morning portrait, woman with coffee + toast, soft window light. 1440Γ—2560, BF16, 127s.
Astronaut in space-station corridor, anamorphic lens flare. 2560Γ—1440, BF16, 187s. Snow-capped mountain peak at sunset. 2048Γ—2048, Q4 (early), 236s.
Alice in cyberpunk, neon Cheshire cat hologram. 2048Γ—2048, Q8, 276s. Fitness influencer mid-deadlift in industrial gym. 1440Γ—2560, BF16, 127s.

More: sample_outputs/hero/.

Variants

Variant Repo Backbone size RAM (1024) Quality
BF16 (this repo) mlx-community/HiDream-O1-Image-Dev-mlx-bf16 17.5 GB 16 GB βœ… Clean across all trained dims
Q8 mlx-community/HiDream-O1-Image-Dev-mlx-q8 10 GB 11.5 GB ⚠ Clean at square dims, grid at non-square
Q6 mlx-community/HiDream-O1-Image-Dev-mlx-q6 8 GB 8.5 GB ⚠ Clean at square dims, grid at non-square

Q4 was tested and rejected β€” brightness collapses, every image ships dark.

Why BF16 is the safe default

Per-group dequantization rounding (Q6/Q8) compounds across the 36 decoder layers and shows as a visible 32-pixel grid in flat regions (skies, walls, water), specifically at non-square trained dimensions like 1440Γ—2560 or 3104Γ—1312. BF16 matches the upstream's torch_dtype=torch.float32 + autocast(bfloat16) precision and is the only quant clean across all trained dimensions.

If your workflow is square-only (1024Γ—1024, 2048Γ—2048) and you're RAM-constrained, Q6 is half the size and 2Γ— faster β€” no quality loss at those dims. Use Q6 on a 16 GB Mac, BF16 on 32 GB+.

Install

Requires macOS on Apple Silicon (M1 or newer). Tested on macOS 14+ with a 64 GB Mac Studio.

Quick start (download pre-converted weights β€” recommended)

# Clone the repo (code, docs, samples)
hf download mlx-community/HiDream-O1-Image-Dev-mlx-bf16 --local-dir hidream-o1-mlx
cd hidream-o1-mlx

# Set up the venv
uv venv --python 3.11
uv pip install -r requirements.txt

# Generate (model files are at the repo root β€” pass --model-path .)
.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
  --model-path . \
  --prompt "your prompt here" \
  --output out.png

Or convert from upstream weights yourself

git clone https://huggingface.co/mlx-community/HiDream-O1-Image-Dev-mlx-bf16
cd HiDream-O1-Image-Dev-mlx-bf16
uv venv --python 3.11
uv pip install -r requirements.txt

# Convert the upstream HF weights to MLX BF16 (~5 minutes, requires ~50 GB free disk)
.venv/bin/python scripts/hidream_o1/convert_hidream_o1_to_mlx.py \
  --hf-source HiDream-ai/HiDream-O1-Image-Dev \
  --out-dir mlx_models/hidream-o1-dev-bf16 \
  --bits 16

Usage

# Single image, default 1024Γ—1024 BF16
.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
  --model-path mlx_models/hidream-o1-dev-bf16 \
  --prompt "your prompt here" \
  --output sample_outputs/whatever.png \
  --seed 42

# Higher resolution (2048Γ—2048 = upstream default)
.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
  --model-path mlx_models/hidream-o1-dev-bf16 \
  --prompt "..." \
  --width 2048 --height 2048 \
  --output sample_outputs/big.png

# Vertical / cinema (auto-snaps to nearest trained ratio)
.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
  --model-path mlx_models/hidream-o1-dev-bf16 \
  --prompt "..." \
  --width 1440 --height 2560 \
  --output sample_outputs/portrait.png

# Instruction-based edit (one ref image)
.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
  --model-path mlx_models/hidream-o1-dev-bf16 \
  --prompt "change the chef's white jacket to a bright red chef jacket, same kitchen, same pose, photorealistic" \
  --output sample_outputs/edit_red_jacket.png \
  --ref-images /path/to/chef.jpg \
  --seed 42

# Multi-reference subject personalization (2-3 refs)
.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
  --model-path mlx_models/hidream-o1-dev-bf16 \
  --prompt "the person from reference 1 standing in the location from reference 2, golden hour, photorealistic" \
  --output sample_outputs/multi_ref.png \
  --ref-images /path/to/person.jpg /path/to/place.jpg \
  --seed 42

Trained resolutions

HiDream-O1 was trained on a fixed list of resolutions. The generator auto-snaps to the closest. Off-spec dims produce visible patch artifacts. The trained list:

2048Γ—2048, 2304Γ—1728, 1728Γ—2304, 2560Γ—1440, 1440Γ—2560,
2496Γ—1664, 1664Γ—2496, 3104Γ—1312, 1312Γ—3104, 2304Γ—1792, 1792Γ—2304

Prompt tips for realism

HiDream is responsive to camera/film terminology. To avoid the AI-glossy look:

  • Lead with masterpiece, best quality (community-found responder phrase)
  • Subject + Actions β†’ Setting β†’ Style β†’ Details ordering
  • Specify equipment: Leica M6 with Kodak Tri-X 400, Pentax K1000 + Cinestill 800T, Hasselblad H6D medium format
  • Reference real photographers: SebastiΓ£o Salgado, Saul Leiter, Wim Wenders, Annie Leibovitz, Anders Petersen
  • Spell out skin imperfection: "natural pores", "faint laugh lines", "weathered hands", "no retouching"
  • Avoid "stunning", "perfect", "beautiful" β€” they push toward AI-glamour aesthetics

The Dev model uses guidance_scale=0.0 so negative prompts have no effect β€” push positive prompts harder instead.

What's in this repo

hidream-o1-mlx/
β”œβ”€β”€ README.md                                 (this file)
β”œβ”€β”€ LICENSE                                   (MIT)
β”œβ”€β”€ requirements.txt                          (mlx-vlm 0.5.0, transformers 5.8+, deps)
β”œβ”€β”€ scripts/hidream_o1/
β”‚   β”œβ”€β”€ convert_hidream_o1_to_mlx.py          (HF β†’ MLX, BF16 / Q4 / Q6 / Q8)
β”‚   β”œβ”€β”€ generate_hidream_o1_mlx.py            (T2I generator + experimental edit/multi-ref)
β”‚   β”œβ”€β”€ hidream_model.py                      (custom heads + forward_generation)
β”‚   β”œβ”€β”€ pipeline_helpers.py                   (T2I sample, mrope, mask, patchify)
β”‚   └── flow_match.py                         (FlashFlowMatchScheduler in MLX)
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ EVALUATION.md                         (perf + quality findings, A/B vs mflux)
β”‚   β”œβ”€β”€ HIDREAM_O1_MLX_PORT_REPORT.md         (architecture + weight conversion details)
β”‚   └── PHOSPHENE_INTEGRATION_PLAN.md         (how it slots into a host app)
β”œβ”€β”€ sample_outputs/                           (gallery)
└── mlx_models/                               (where converted weights land)

Performance

Resolution Per step Total (28 steps) Peak RAM
1024Γ—1024 2.4 s 67 s 16 GB
1440Γ—2560 4.5 s 127 s 16 GB
2048Γ—2048 6.7 s 187 s 16 GB
3104Γ—1312 7.6 s 213 s 16 GB

mx.compile gives 0% speedup β€” the inference loop is bandwidth-bound on the 36-layer BF16 decoder. To go faster you'd need a smaller distillation (none public) or text-cache reuse across denoising steps.

Status

  • βœ… Text-to-image: production-quality, BF16 default, ~67 s / 1024Γ—1024 on a 64 GB Mac
  • βœ… Instruction edit (K=1 ref): working at BF16. Verified: same chef, same kitchen, same pose, only the jacket colour changed.
  • βœ… Multi-reference subject personalization (K=2-3 refs): supported by the upstream architecture and our port; same --ref-images flag with multiple paths
  • βœ… Native MLX β€” no PyTorch, no CUDA, no flash-attn at inference time
  • ⚠ Edit requires BF16. Q6/Q8 quantization breaks the attention against ref features (degenerate output). The text-to-image path is fine at all quants.

Acknowledgements

Citation

If you use this in research, cite the upstream model:

@misc{hidream-o1-image,
  author = {HiDream-ai},
  title = {HiDream-O1-Image: Pixel-Level Unified Transformer},
  year = {2026},
  url = {https://github.com/HiDream-ai/HiDream-O1-Image}
}

License

MIT β€” see LICENSE.