File size: 12,394 Bytes
ffe929e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1b3bbb7
 
ffe929e
 
 
 
1b3bbb7
 
33c7a00
 
 
 
 
 
ffe929e
 
33c7a00
 
 
 
 
ffe929e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
846406a
ffe929e
846406a
 
6d67205
 
 
ffe929e
846406a
 
 
 
 
 
 
ffe929e
 
 
 
 
2f3aad3
 
ffe929e
2f3aad3
6d67205
ffe929e
2f3aad3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6d67205
2f3aad3
ffe929e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33c7a00
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ffe929e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33c7a00
 
 
 
 
ffe929e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
---
license: mit
base_model: HiDream-ai/HiDream-O1-Image-Dev
tags:
  - mlx
  - mlx-vlm
  - hidream
  - text-to-image
  - apple-silicon
  - bf16
language:
  - en
pipeline_tag: text-to-image
library_name: mlx
inference: false
authors:
  - Mrbizarro
---

# HiDream-O1-Image-Dev β€” MLX port for Apple Silicon

> Ported by **[Mrbizarro](https://huggingface.co/Mrbizarro)** Β· MIT licensed Β· published to mlx-community

## πŸŽ›οΈ Run it one-click in **[Phosphene](https://github.com/mrbizarro/phosphene)**

Phosphene is a free local generative-video panel for Apple Silicon (Mac, M1+). It ships with HiDream-O1 wired into its Image Studio β€” pick **"HiDream-O1-Image-Dev BF16"** from the engine dropdown and you have native edit + multi-reference support out of the box. No conda, no Python tinkering, no separate venv setup. **[Install Pinokio](https://pinokio.computer)**, then in Pinokio install [Phosphene](https://github.com/mrbizarro/phosphene).

---

A native MLX port of [HiDream-ai/HiDream-O1-Image-Dev](https://huggingface.co/HiDream-ai/HiDream-O1-Image-Dev) for fast local image generation on Apple Silicon Macs. **No PyTorch, no CUDA, no flash-attn required at inference time.**

**Capabilities** (all native to HiDream-O1, all working in this port):
- **Text-to-image** at 1024Γ—1024 / 2048Γ—2048 / non-square trained dims
- **Instruction-based image edit** with 1 reference image (e.g. *"change the chef's white jacket to red"* β€” preserves scene, pose, identity)
- **Multi-reference subject personalization** with 2-3 reference images (compose multiple subjects in a new scene)

HiDream-O1 is an 8B Qwen3-VL-based **unified pixel-patch transformer** β€” it predicts raw 32Γ—32 RGB patches directly through the same backbone that handles text, with no separate VAE. The Dev variant is a 28-step distillation of the 50-step Full model, released under the MIT license.

This port:
- Reuses [`mlx-vlm`](https://github.com/Blaizzy/mlx-vlm)'s Qwen3-VL backbone (vision tower, decoder layers, mrope-3D)
- Adds the three diffusion-side custom heads (`t_embedder1`, `x_embedder`, `final_layer2`)
- Ports the `FlashFlowMatchEulerDiscreteScheduler` and the unified-token-sequence builder
- Ships **BF16 weights** (no quantization β€” see "Why BF16" below)

## Hero samples

All generated by the included generator script on a 64 GB Mac Studio. Click any image to open full-resolution.

<table>
<tr>
<td><a href="sample_outputs/hero/04_construction_worker.png"><img src="sample_outputs/hero/04_construction_worker.png" width="350"/></a></td>
<td><a href="sample_outputs/hero/01_tea_master.png"><img src="sample_outputs/hero/01_tea_master.png" width="350"/></a></td>
</tr>
<tr>
<td>Construction worker on a rainy rooftop, Kodak Tri-X B&amp;W. 2048Γ—2048, BF16, 213s.</td>
<td>Elderly Japanese tea master holding a ceramic cup. 1024Γ—1024, Q6 (showcase), 36s.</td>
</tr>

<tr>
<td><a href="sample_outputs/hero/02_tropical_beach.png"><img src="sample_outputs/hero/02_tropical_beach.png" width="350"/></a></td>
<td><a href="sample_outputs/hero/07_kitchen_morning.png"><img src="sample_outputs/hero/07_kitchen_morning.png" width="350"/></a></td>
</tr>
<tr>
<td>Tropical beach with turquoise water and palms. 1024Γ—1024, Q8, 67s.</td>
<td>Candid morning portrait, woman with coffee + toast, soft window light. 1440Γ—2560, BF16, 127s.</td>
</tr>

<tr>
<td><a href="sample_outputs/hero/03_astronaut.png"><img src="sample_outputs/hero/03_astronaut.png" width="350"/></a></td>
<td><a href="sample_outputs/hero/05_mountain_peak.png"><img src="sample_outputs/hero/05_mountain_peak.png" width="350"/></a></td>
</tr>
<tr>
<td>Astronaut in space-station corridor, anamorphic lens flare. 2560Γ—1440, BF16, 187s.</td>
<td>Snow-capped mountain peak at sunset. 2048Γ—2048, Q4 (early), 236s.</td>
</tr>

<tr>
<td><a href="sample_outputs/hero/06_alice_cyberpunk.png"><img src="sample_outputs/hero/06_alice_cyberpunk.png" width="350"/></a></td>
<td><a href="sample_outputs/hero/08_fitness_BF16.png"><img src="sample_outputs/hero/08_fitness_BF16.png" width="350"/></a></td>
</tr>
<tr>
<td>Alice in cyberpunk, neon Cheshire cat hologram. 2048Γ—2048, Q8, 276s.</td>
<td>Fitness influencer mid-deadlift in industrial gym. 1440Γ—2560, BF16, 127s.</td>
</tr>
</table>

More: [`sample_outputs/hero/`](sample_outputs/hero/).

## Variants

| Variant | Repo | Backbone size | RAM (1024) | Quality |
|---|---|---|---|---|
| **BF16** (this repo) | `mlx-community/HiDream-O1-Image-Dev-mlx-bf16` | 17.5 GB | 16 GB | βœ… Clean across all trained dims |
| Q8 | [`mlx-community/HiDream-O1-Image-Dev-mlx-q8`](https://huggingface.co/mlx-community/HiDream-O1-Image-Dev-mlx-q8) | 10 GB | 11.5 GB | ⚠ Clean at square dims, grid at non-square |
| Q6 | [`mlx-community/HiDream-O1-Image-Dev-mlx-q6`](https://huggingface.co/mlx-community/HiDream-O1-Image-Dev-mlx-q6) | 8 GB | 8.5 GB | ⚠ Clean at square dims, grid at non-square |

**Q4 was tested and rejected** β€” brightness collapses, every image ships dark.

### Why BF16 is the safe default

Per-group dequantization rounding (Q6/Q8) compounds across the 36 decoder layers and shows as a visible 32-pixel grid in flat regions (skies, walls, water), specifically at **non-square trained dimensions** like 1440Γ—2560 or 3104Γ—1312. BF16 matches the upstream's `torch_dtype=torch.float32 + autocast(bfloat16)` precision and is the only quant clean across all trained dimensions.

If your workflow is square-only (1024Γ—1024, 2048Γ—2048) and you're RAM-constrained, **Q6 is half the size and 2Γ— faster** β€” no quality loss at those dims. Use Q6 on a 16 GB Mac, BF16 on 32 GB+.

## Install

Requires macOS on Apple Silicon (M1 or newer). Tested on macOS 14+ with a 64 GB Mac Studio.

### Quick start (download pre-converted weights β€” recommended)

```bash
# Clone the repo (code, docs, samples)
hf download mlx-community/HiDream-O1-Image-Dev-mlx-bf16 --local-dir hidream-o1-mlx
cd hidream-o1-mlx

# Set up the venv
uv venv --python 3.11
uv pip install -r requirements.txt

# Generate (model files are at the repo root β€” pass --model-path .)
.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
  --model-path . \
  --prompt "your prompt here" \
  --output out.png
```

### Or convert from upstream weights yourself

```bash
git clone https://huggingface.co/mlx-community/HiDream-O1-Image-Dev-mlx-bf16
cd HiDream-O1-Image-Dev-mlx-bf16
uv venv --python 3.11
uv pip install -r requirements.txt

# Convert the upstream HF weights to MLX BF16 (~5 minutes, requires ~50 GB free disk)
.venv/bin/python scripts/hidream_o1/convert_hidream_o1_to_mlx.py \
  --hf-source HiDream-ai/HiDream-O1-Image-Dev \
  --out-dir mlx_models/hidream-o1-dev-bf16 \
  --bits 16
```

## Usage

```bash
# Single image, default 1024Γ—1024 BF16
.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
  --model-path mlx_models/hidream-o1-dev-bf16 \
  --prompt "your prompt here" \
  --output sample_outputs/whatever.png \
  --seed 42

# Higher resolution (2048Γ—2048 = upstream default)
.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
  --model-path mlx_models/hidream-o1-dev-bf16 \
  --prompt "..." \
  --width 2048 --height 2048 \
  --output sample_outputs/big.png

# Vertical / cinema (auto-snaps to nearest trained ratio)
.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
  --model-path mlx_models/hidream-o1-dev-bf16 \
  --prompt "..." \
  --width 1440 --height 2560 \
  --output sample_outputs/portrait.png

# Instruction-based edit (one ref image)
.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
  --model-path mlx_models/hidream-o1-dev-bf16 \
  --prompt "change the chef's white jacket to a bright red chef jacket, same kitchen, same pose, photorealistic" \
  --output sample_outputs/edit_red_jacket.png \
  --ref-images /path/to/chef.jpg \
  --seed 42

# Multi-reference subject personalization (2-3 refs)
.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
  --model-path mlx_models/hidream-o1-dev-bf16 \
  --prompt "the person from reference 1 standing in the location from reference 2, golden hour, photorealistic" \
  --output sample_outputs/multi_ref.png \
  --ref-images /path/to/person.jpg /path/to/place.jpg \
  --seed 42
```

### Trained resolutions

HiDream-O1 was trained on a fixed list of resolutions. The generator auto-snaps to the closest. Off-spec dims produce visible patch artifacts. The trained list:

```
2048Γ—2048, 2304Γ—1728, 1728Γ—2304, 2560Γ—1440, 1440Γ—2560,
2496Γ—1664, 1664Γ—2496, 3104Γ—1312, 1312Γ—3104, 2304Γ—1792, 1792Γ—2304
```

## Prompt tips for realism

HiDream is responsive to camera/film terminology. To avoid the AI-glossy look:

- Lead with `masterpiece, best quality` (community-found responder phrase)
- Subject + Actions β†’ Setting β†’ Style β†’ Details ordering
- Specify equipment: `Leica M6 with Kodak Tri-X 400`, `Pentax K1000 + Cinestill 800T`, `Hasselblad H6D medium format`
- Reference real photographers: SebastiΓ£o Salgado, Saul Leiter, Wim Wenders, Annie Leibovitz, Anders Petersen
- Spell out skin imperfection: "natural pores", "faint laugh lines", "weathered hands", "no retouching"
- Avoid "stunning", "perfect", "beautiful" β€” they push toward AI-glamour aesthetics

The Dev model uses `guidance_scale=0.0` so negative prompts have no effect β€” push positive prompts harder instead.

## What's in this repo

```
hidream-o1-mlx/
β”œβ”€β”€ README.md                                 (this file)
β”œβ”€β”€ LICENSE                                   (MIT)
β”œβ”€β”€ requirements.txt                          (mlx-vlm 0.5.0, transformers 5.8+, deps)
β”œβ”€β”€ scripts/hidream_o1/
β”‚   β”œβ”€β”€ convert_hidream_o1_to_mlx.py          (HF β†’ MLX, BF16 / Q4 / Q6 / Q8)
β”‚   β”œβ”€β”€ generate_hidream_o1_mlx.py            (T2I generator + experimental edit/multi-ref)
β”‚   β”œβ”€β”€ hidream_model.py                      (custom heads + forward_generation)
β”‚   β”œβ”€β”€ pipeline_helpers.py                   (T2I sample, mrope, mask, patchify)
β”‚   └── flow_match.py                         (FlashFlowMatchScheduler in MLX)
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ EVALUATION.md                         (perf + quality findings, A/B vs mflux)
β”‚   β”œβ”€β”€ HIDREAM_O1_MLX_PORT_REPORT.md         (architecture + weight conversion details)
β”‚   └── PHOSPHENE_INTEGRATION_PLAN.md         (how it slots into a host app)
β”œβ”€β”€ sample_outputs/                           (gallery)
└── mlx_models/                               (where converted weights land)
```

## Performance

| Resolution | Per step | Total (28 steps) | Peak RAM |
|---|---|---|---|
| 1024Γ—1024 | 2.4 s | 67 s | 16 GB |
| 1440Γ—2560 | 4.5 s | 127 s | 16 GB |
| 2048Γ—2048 | 6.7 s | 187 s | 16 GB |
| 3104Γ—1312 | 7.6 s | 213 s | 16 GB |

`mx.compile` gives 0% speedup β€” the inference loop is bandwidth-bound on the 36-layer BF16 decoder. To go faster you'd need a smaller distillation (none public) or text-cache reuse across denoising steps.

## Status

- βœ… **Text-to-image**: production-quality, BF16 default, ~67 s / 1024Γ—1024 on a 64 GB Mac
- βœ… **Instruction edit (K=1 ref)**: working at BF16. Verified: same chef, same kitchen, same pose, only the jacket colour changed.
- βœ… **Multi-reference subject personalization (K=2-3 refs)**: supported by the upstream architecture and our port; same `--ref-images` flag with multiple paths
- βœ… Native MLX β€” no PyTorch, no CUDA, no flash-attn at inference time
- ⚠ Edit requires BF16. Q6/Q8 quantization breaks the attention against ref features (degenerate output). The text-to-image path is fine at all quants.

## Acknowledgements

- [HiDream-ai](https://github.com/HiDream-ai) for the original HiDream-O1-Image model + MIT license
- [Blaizzy/mlx-vlm](https://github.com/Blaizzy/mlx-vlm) for the Qwen3-VL MLX backbone (this port reuses their vision tower + decoder layers + mrope-3D wholesale)
- [Apple ml-explore/mlx](https://github.com/ml-explore/mlx) for the MLX framework
- The Civitai community's [HiDream prompt-engineering guide](https://civitai.com/articles/16050/hi-dream-prompt-engineering)

## Citation

If you use this in research, cite the upstream model:

```bibtex
@misc{hidream-o1-image,
  author = {HiDream-ai},
  title = {HiDream-O1-Image: Pixel-Level Unified Transformer},
  year = {2026},
  url = {https://github.com/HiDream-ai/HiDream-O1-Image}
}
```

## License

MIT β€” see [LICENSE](LICENSE).