mlx-community
/

HiDream-O1-Image-Dev-mlx-bf16

@@ -21,8 +21,19 @@ authors:
 > Ported by **[Mrbizarro](https://huggingface.co/Mrbizarro)** · MIT licensed · published to mlx-community
 A native MLX port of [HiDream-ai/HiDream-O1-Image-Dev](https://huggingface.co/HiDream-ai/HiDream-O1-Image-Dev) for fast local image generation on Apple Silicon Macs. **No PyTorch, no CUDA, no flash-attn required at inference time.**
 HiDream-O1 is an 8B Qwen3-VL-based **unified pixel-patch transformer** — it predicts raw 32×32 RGB patches directly through the same backbone that handles text, with no separate VAE. The Dev variant is a 28-step distillation of the 50-step Full model, released under the MIT license.
 This port:
@@ -151,6 +162,22 @@ uv pip install -r requirements.txt
   --prompt "..." \
   --width 1440 --height 2560 \
   --output sample_outputs/portrait.png
 ```
 ### Trained resolutions
@@ -209,10 +236,11 @@ hidream-o1-mlx/
 ## Status
-- ✅ Text-to-image: production-quality, BF16 default
-- ✅ Native MLX, no PyTorch / CUDA / flash-attn at inference time
-- ⚠ Edit / multi-reference: scaffolding present (`--ref-images` flag) but produces degenerate output — needs debugging. Refs through other engines (e.g. `mflux qwen-edit`) work correctly.
-- ❌ Multi-reference subject personalization: same as above
 ## Acknowledgements

 > Ported by **[Mrbizarro](https://huggingface.co/Mrbizarro)** · MIT licensed · published to mlx-community
+## 🎛️ Run it one-click in **[Phosphene](https://github.com/mrbizarro/phosphene)**
+Phosphene is a free local generative-video panel for Apple Silicon (Mac, M1+). It ships with HiDream-O1 wired into its Image Studio — pick **"HiDream-O1-Image-Dev BF16"** from the engine dropdown and you have native edit + multi-reference support out of the box. No conda, no Python tinkering, no separate venv setup. **[Install Pinokio](https://pinokio.computer)**, then in Pinokio install [Phosphene](https://github.com/mrbizarro/phosphene).
+---
 A native MLX port of [HiDream-ai/HiDream-O1-Image-Dev](https://huggingface.co/HiDream-ai/HiDream-O1-Image-Dev) for fast local image generation on Apple Silicon Macs. **No PyTorch, no CUDA, no flash-attn required at inference time.**
+**Capabilities** (all native to HiDream-O1, all working in this port):
+- **Text-to-image** at 1024×1024 / 2048×2048 / non-square trained dims
+- **Instruction-based image edit** with 1 reference image (e.g. *"change the chef's white jacket to red"* — preserves scene, pose, identity)
+- **Multi-reference subject personalization** with 2-3 reference images (compose multiple subjects in a new scene)
 HiDream-O1 is an 8B Qwen3-VL-based **unified pixel-patch transformer** — it predicts raw 32×32 RGB patches directly through the same backbone that handles text, with no separate VAE. The Dev variant is a 28-step distillation of the 50-step Full model, released under the MIT license.
 This port:
   --prompt "..." \
   --width 1440 --height 2560 \
   --output sample_outputs/portrait.png
+# Instruction-based edit (one ref image)
+.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
+  --model-path mlx_models/hidream-o1-dev-bf16 \
+  --prompt "change the chef's white jacket to a bright red chef jacket, same kitchen, same pose, photorealistic" \
+  --output sample_outputs/edit_red_jacket.png \
+  --ref-images /path/to/chef.jpg \
+  --seed 42
+# Multi-reference subject personalization (2-3 refs)
+.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
+  --model-path mlx_models/hidream-o1-dev-bf16 \
+  --prompt "the person from reference 1 standing in the location from reference 2, golden hour, photorealistic" \
+  --output sample_outputs/multi_ref.png \
+  --ref-images /path/to/person.jpg /path/to/place.jpg \
+  --seed 42
 ```
 ### Trained resolutions
 ## Status
+- ✅ **Text-to-image**: production-quality, BF16 default, ~67 s / 1024×1024 on a 64 GB Mac
+- ✅ **Instruction edit (K=1 ref)**: working at BF16. Verified: same chef, same kitchen, same pose, only the jacket colour changed.
+- ✅ **Multi-reference subject personalization (K=2-3 refs)**: supported by the upstream architecture and our port; same `--ref-images` flag with multiple paths
+- ✅ Native MLX — no PyTorch, no CUDA, no flash-attn at inference time
+- ⚠ Edit requires BF16. Q6/Q8 quantization breaks the attention against ref features (degenerate output). The text-to-image path is fine at all quants.
 ## Acknowledgements