README: Phosphene one-click banner + edit support clarified

33c7a00 verified 11 days ago

12.4 kB

	---
	license: mit
	base_model: HiDream-ai/HiDream-O1-Image-Dev
	tags:
	- mlx
	- mlx-vlm
	- hidream
	- text-to-image
	- apple-silicon
	- bf16
	language:
	- en
	pipeline_tag: text-to-image
	library_name: mlx
	inference: false
	authors:
	- Mrbizarro
	---

	# HiDream-O1-Image-Dev — MLX port for Apple Silicon

	> Ported by [Mrbizarro](https://huggingface.co/Mrbizarro) · MIT licensed · published to mlx-community

	## 🎛️ Run it one-click in [Phosphene](https://github.com/mrbizarro/phosphene)

	Phosphene is a free local generative-video panel for Apple Silicon (Mac, M1+). It ships with HiDream-O1 wired into its Image Studio — pick "HiDream-O1-Image-Dev BF16" from the engine dropdown and you have native edit + multi-reference support out of the box. No conda, no Python tinkering, no separate venv setup. [Install Pinokio](https://pinokio.computer), then in Pinokio install [Phosphene](https://github.com/mrbizarro/phosphene).

	---

	A native MLX port of [HiDream-ai/HiDream-O1-Image-Dev](https://huggingface.co/HiDream-ai/HiDream-O1-Image-Dev) for fast local image generation on Apple Silicon Macs. No PyTorch, no CUDA, no flash-attn required at inference time.

	Capabilities (all native to HiDream-O1, all working in this port):
	- Text-to-image at 1024×1024 / 2048×2048 / non-square trained dims
	- Instruction-based image edit with 1 reference image (e.g. "change the chef's white jacket to red" — preserves scene, pose, identity)
	- Multi-reference subject personalization with 2-3 reference images (compose multiple subjects in a new scene)

	HiDream-O1 is an 8B Qwen3-VL-based unified pixel-patch transformer — it predicts raw 32×32 RGB patches directly through the same backbone that handles text, with no separate VAE. The Dev variant is a 28-step distillation of the 50-step Full model, released under the MIT license.

	This port:
	- Reuses [`mlx-vlm`](https://github.com/Blaizzy/mlx-vlm)'s Qwen3-VL backbone (vision tower, decoder layers, mrope-3D)
	- Adds the three diffusion-side custom heads (`t_embedder1`, `x_embedder`, `final_layer2`)
	- Ports the `FlashFlowMatchEulerDiscreteScheduler` and the unified-token-sequence builder
	- Ships BF16 weights (no quantization — see "Why BF16" below)

	## Hero samples

	All generated by the included generator script on a 64 GB Mac Studio. Click any image to open full-resolution.

	<table>
	<tr>
	<td><a href="sample_outputs/hero/04_construction_worker.png"><img src="sample_outputs/hero/04_construction_worker.png" width="350"/></a></td>
	<td><a href="sample_outputs/hero/01_tea_master.png"><img src="sample_outputs/hero/01_tea_master.png" width="350"/></a></td>
	</tr>
	<tr>
	<td>Construction worker on a rainy rooftop, Kodak Tri-X B&W. 2048×2048, BF16, 213s.</td>
	<td>Elderly Japanese tea master holding a ceramic cup. 1024×1024, Q6 (showcase), 36s.</td>
	</tr>

	<tr>
	<td><a href="sample_outputs/hero/02_tropical_beach.png"><img src="sample_outputs/hero/02_tropical_beach.png" width="350"/></a></td>
	<td><a href="sample_outputs/hero/07_kitchen_morning.png"><img src="sample_outputs/hero/07_kitchen_morning.png" width="350"/></a></td>
	</tr>
	<tr>
	<td>Tropical beach with turquoise water and palms. 1024×1024, Q8, 67s.</td>
	<td>Candid morning portrait, woman with coffee + toast, soft window light. 1440×2560, BF16, 127s.</td>
	</tr>

	<tr>
	<td><a href="sample_outputs/hero/03_astronaut.png"><img src="sample_outputs/hero/03_astronaut.png" width="350"/></a></td>
	<td><a href="sample_outputs/hero/05_mountain_peak.png"><img src="sample_outputs/hero/05_mountain_peak.png" width="350"/></a></td>
	</tr>
	<tr>
	<td>Astronaut in space-station corridor, anamorphic lens flare. 2560×1440, BF16, 187s.</td>
	<td>Snow-capped mountain peak at sunset. 2048×2048, Q4 (early), 236s.</td>
	</tr>

	<tr>
	<td><a href="sample_outputs/hero/06_alice_cyberpunk.png"><img src="sample_outputs/hero/06_alice_cyberpunk.png" width="350"/></a></td>
	<td><a href="sample_outputs/hero/08_fitness_BF16.png"><img src="sample_outputs/hero/08_fitness_BF16.png" width="350"/></a></td>
	</tr>
	<tr>
	<td>Alice in cyberpunk, neon Cheshire cat hologram. 2048×2048, Q8, 276s.</td>
	<td>Fitness influencer mid-deadlift in industrial gym. 1440×2560, BF16, 127s.</td>
	</tr>
	</table>

	More: [`sample_outputs/hero/`](sample_outputs/hero/).

	## Variants

	\| Variant \| Repo \| Backbone size \| RAM (1024) \| Quality \|
	\|---\|---\|---\|---\|---\|
	\| BF16 (this repo) \| `mlx-community/HiDream-O1-Image-Dev-mlx-bf16` \| 17.5 GB \| 16 GB \| ✅ Clean across all trained dims \|
	\| Q8 \| [`mlx-community/HiDream-O1-Image-Dev-mlx-q8`](https://huggingface.co/mlx-community/HiDream-O1-Image-Dev-mlx-q8) \| 10 GB \| 11.5 GB \| ⚠ Clean at square dims, grid at non-square \|
	\| Q6 \| [`mlx-community/HiDream-O1-Image-Dev-mlx-q6`](https://huggingface.co/mlx-community/HiDream-O1-Image-Dev-mlx-q6) \| 8 GB \| 8.5 GB \| ⚠ Clean at square dims, grid at non-square \|

	Q4 was tested and rejected — brightness collapses, every image ships dark.

	### Why BF16 is the safe default

	Per-group dequantization rounding (Q6/Q8) compounds across the 36 decoder layers and shows as a visible 32-pixel grid in flat regions (skies, walls, water), specifically at non-square trained dimensions like 1440×2560 or 3104×1312. BF16 matches the upstream's `torch_dtype=torch.float32 + autocast(bfloat16)` precision and is the only quant clean across all trained dimensions.

	If your workflow is square-only (1024×1024, 2048×2048) and you're RAM-constrained, Q6 is half the size and 2× faster — no quality loss at those dims. Use Q6 on a 16 GB Mac, BF16 on 32 GB+.

	## Install

	Requires macOS on Apple Silicon (M1 or newer). Tested on macOS 14+ with a 64 GB Mac Studio.

	### Quick start (download pre-converted weights — recommended)

	```bash
	# Clone the repo (code, docs, samples)
	hf download mlx-community/HiDream-O1-Image-Dev-mlx-bf16 --local-dir hidream-o1-mlx
	cd hidream-o1-mlx

	# Set up the venv
	uv venv --python 3.11
	uv pip install -r requirements.txt

	# Generate (model files are at the repo root — pass --model-path .)
	.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
	--model-path . \
	--prompt "your prompt here" \
	--output out.png
	```

	### Or convert from upstream weights yourself

	```bash
	git clone https://huggingface.co/mlx-community/HiDream-O1-Image-Dev-mlx-bf16
	cd HiDream-O1-Image-Dev-mlx-bf16
	uv venv --python 3.11
	uv pip install -r requirements.txt

	# Convert the upstream HF weights to MLX BF16 (~5 minutes, requires ~50 GB free disk)
	.venv/bin/python scripts/hidream_o1/convert_hidream_o1_to_mlx.py \
	--hf-source HiDream-ai/HiDream-O1-Image-Dev \
	--out-dir mlx_models/hidream-o1-dev-bf16 \
	--bits 16
	```

	## Usage

	```bash
	# Single image, default 1024×1024 BF16
	.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
	--model-path mlx_models/hidream-o1-dev-bf16 \
	--prompt "your prompt here" \
	--output sample_outputs/whatever.png \
	--seed 42

	# Higher resolution (2048×2048 = upstream default)
	.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
	--model-path mlx_models/hidream-o1-dev-bf16 \
	--prompt "..." \
	--width 2048 --height 2048 \
	--output sample_outputs/big.png

	# Vertical / cinema (auto-snaps to nearest trained ratio)
	.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
	--model-path mlx_models/hidream-o1-dev-bf16 \
	--prompt "..." \
	--width 1440 --height 2560 \
	--output sample_outputs/portrait.png

	# Instruction-based edit (one ref image)
	.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
	--model-path mlx_models/hidream-o1-dev-bf16 \
	--prompt "change the chef's white jacket to a bright red chef jacket, same kitchen, same pose, photorealistic" \
	--output sample_outputs/edit_red_jacket.png \
	--ref-images /path/to/chef.jpg \
	--seed 42

	# Multi-reference subject personalization (2-3 refs)
	.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
	--model-path mlx_models/hidream-o1-dev-bf16 \
	--prompt "the person from reference 1 standing in the location from reference 2, golden hour, photorealistic" \
	--output sample_outputs/multi_ref.png \
	--ref-images /path/to/person.jpg /path/to/place.jpg \
	--seed 42
	```

	### Trained resolutions

	HiDream-O1 was trained on a fixed list of resolutions. The generator auto-snaps to the closest. Off-spec dims produce visible patch artifacts. The trained list:

	```
	2048×2048, 2304×1728, 1728×2304, 2560×1440, 1440×2560,
	2496×1664, 1664×2496, 3104×1312, 1312×3104, 2304×1792, 1792×2304
	```

	## Prompt tips for realism

	HiDream is responsive to camera/film terminology. To avoid the AI-glossy look:

	- Lead with `masterpiece, best quality` (community-found responder phrase)
	- Subject + Actions → Setting → Style → Details ordering
	- Specify equipment: `Leica M6 with Kodak Tri-X 400`, `Pentax K1000 + Cinestill 800T`, `Hasselblad H6D medium format`
	- Reference real photographers: Sebastião Salgado, Saul Leiter, Wim Wenders, Annie Leibovitz, Anders Petersen
	- Spell out skin imperfection: "natural pores", "faint laugh lines", "weathered hands", "no retouching"
	- Avoid "stunning", "perfect", "beautiful" — they push toward AI-glamour aesthetics

	The Dev model uses `guidance_scale=0.0` so negative prompts have no effect — push positive prompts harder instead.

	## What's in this repo

	```
	hidream-o1-mlx/
	├── README.md (this file)
	├── LICENSE (MIT)
	├── requirements.txt (mlx-vlm 0.5.0, transformers 5.8+, deps)
	├── scripts/hidream_o1/
	│ ├── convert_hidream_o1_to_mlx.py (HF → MLX, BF16 / Q4 / Q6 / Q8)
	│ ├── generate_hidream_o1_mlx.py (T2I generator + experimental edit/multi-ref)
	│ ├── hidream_model.py (custom heads + forward_generation)
	│ ├── pipeline_helpers.py (T2I sample, mrope, mask, patchify)
	│ └── flow_match.py (FlashFlowMatchScheduler in MLX)
	├── docs/
	│ ├── EVALUATION.md (perf + quality findings, A/B vs mflux)
	│ ├── HIDREAM_O1_MLX_PORT_REPORT.md (architecture + weight conversion details)
	│ └── PHOSPHENE_INTEGRATION_PLAN.md (how it slots into a host app)
	├── sample_outputs/ (gallery)
	└── mlx_models/ (where converted weights land)
	```

	## Performance

	\| Resolution \| Per step \| Total (28 steps) \| Peak RAM \|
	\|---\|---\|---\|---\|
	\| 1024×1024 \| 2.4 s \| 67 s \| 16 GB \|
	\| 1440×2560 \| 4.5 s \| 127 s \| 16 GB \|
	\| 2048×2048 \| 6.7 s \| 187 s \| 16 GB \|
	\| 3104×1312 \| 7.6 s \| 213 s \| 16 GB \|

	`mx.compile` gives 0% speedup — the inference loop is bandwidth-bound on the 36-layer BF16 decoder. To go faster you'd need a smaller distillation (none public) or text-cache reuse across denoising steps.

	## Status

	- ✅ Text-to-image: production-quality, BF16 default, ~67 s / 1024×1024 on a 64 GB Mac
	- ✅ Instruction edit (K=1 ref): working at BF16. Verified: same chef, same kitchen, same pose, only the jacket colour changed.
	- ✅ Multi-reference subject personalization (K=2-3 refs): supported by the upstream architecture and our port; same `--ref-images` flag with multiple paths
	- ✅ Native MLX — no PyTorch, no CUDA, no flash-attn at inference time
	- ⚠ Edit requires BF16. Q6/Q8 quantization breaks the attention against ref features (degenerate output). The text-to-image path is fine at all quants.

	## Acknowledgements

	- [HiDream-ai](https://github.com/HiDream-ai) for the original HiDream-O1-Image model + MIT license
	- [Blaizzy/mlx-vlm](https://github.com/Blaizzy/mlx-vlm) for the Qwen3-VL MLX backbone (this port reuses their vision tower + decoder layers + mrope-3D wholesale)
	- [Apple ml-explore/mlx](https://github.com/ml-explore/mlx) for the MLX framework
	- The Civitai community's [HiDream prompt-engineering guide](https://civitai.com/articles/16050/hi-dream-prompt-engineering)

	## Citation

	If you use this in research, cite the upstream model:

	```bibtex
	@misc{hidream-o1-image,
	author = {HiDream-ai},
	title = {HiDream-O1-Image: Pixel-Level Unified Transformer},
	year = {2026},
	url = {https://github.com/HiDream-ai/HiDream-O1-Image}
	}
	```

	## License

	MIT — see [LICENSE](LICENSE).