File size: 12,394 Bytes

---
license: mit
base_model: HiDream-ai/HiDream-O1-Image-Dev
tags:
  - mlx
  - mlx-vlm
  - hidream
  - text-to-image
  - apple-silicon
  - bf16
language:
  - en
pipeline_tag: text-to-image
library_name: mlx
inference: false
authors:
  - Mrbizarro
---

# HiDream-O1-Image-Dev — MLX port for Apple Silicon

> Ported by **[Mrbizarro](https://huggingface.co/Mrbizarro)** · MIT licensed · published to mlx-community

## 🎛️ Run it one-click in **[Phosphene](https://github.com/mrbizarro/phosphene)**

Phosphene is a free local generative-video panel for Apple Silicon (Mac, M1+). It ships with HiDream-O1 wired into its Image Studio — pick **"HiDream-O1-Image-Dev BF16"** from the engine dropdown and you have native edit + multi-reference support out of the box. No conda, no Python tinkering, no separate venv setup. **[Install Pinokio](https://pinokio.computer)**, then in Pinokio install [Phosphene](https://github.com/mrbizarro/phosphene).

---

A native MLX port of [HiDream-ai/HiDream-O1-Image-Dev](https://huggingface.co/HiDream-ai/HiDream-O1-Image-Dev) for fast local image generation on Apple Silicon Macs. **No PyTorch, no CUDA, no flash-attn required at inference time.**

**Capabilities** (all native to HiDream-O1, all working in this port):
- **Text-to-image** at 1024×1024 / 2048×2048 / non-square trained dims
- **Instruction-based image edit** with 1 reference image (e.g. *"change the chef's white jacket to red"* — preserves scene, pose, identity)
- **Multi-reference subject personalization** with 2-3 reference images (compose multiple subjects in a new scene)

HiDream-O1 is an 8B Qwen3-VL-based **unified pixel-patch transformer** — it predicts raw 32×32 RGB patches directly through the same backbone that handles text, with no separate VAE. The Dev variant is a 28-step distillation of the 50-step Full model, released under the MIT license.

This port:
- Reuses [`mlx-vlm`](https://github.com/Blaizzy/mlx-vlm)'s Qwen3-VL backbone (vision tower, decoder layers, mrope-3D)
- Adds the three diffusion-side custom heads (`t_embedder1`, `x_embedder`, `final_layer2`)
- Ports the `FlashFlowMatchEulerDiscreteScheduler` and the unified-token-sequence builder
- Ships **BF16 weights** (no quantization — see "Why BF16" below)

## Hero samples

All generated by the included generator script on a 64 GB Mac Studio. Click any image to open full-resolution.

<table>
<tr>
<td><a href="sample_outputs/hero/04_construction_worker.png"><img src="sample_outputs/hero/04_construction_worker.png" width="350"/></a></td>
<td><a href="sample_outputs/hero/01_tea_master.png"><img src="sample_outputs/hero/01_tea_master.png" width="350"/></a></td>
</tr>
<tr>
<td>Construction worker on a rainy rooftop, Kodak Tri-X B&amp;W. 2048×2048, BF16, 213s.</td>
<td>Elderly Japanese tea master holding a ceramic cup. 1024×1024, Q6 (showcase), 36s.</td>
</tr>

<tr>
<td><a href="sample_outputs/hero/02_tropical_beach.png"><img src="sample_outputs/hero/02_tropical_beach.png" width="350"/></a></td>
<td><a href="sample_outputs/hero/07_kitchen_morning.png"><img src="sample_outputs/hero/07_kitchen_morning.png" width="350"/></a></td>
</tr>
<tr>
<td>Tropical beach with turquoise water and palms. 1024×1024, Q8, 67s.</td>
<td>Candid morning portrait, woman with coffee + toast, soft window light. 1440×2560, BF16, 127s.</td>
</tr>

<tr>
<td><a href="sample_outputs/hero/03_astronaut.png"><img src="sample_outputs/hero/03_astronaut.png" width="350"/></a></td>
<td><a href="sample_outputs/hero/05_mountain_peak.png"><img src="sample_outputs/hero/05_mountain_peak.png" width="350"/></a></td>
</tr>
<tr>
<td>Astronaut in space-station corridor, anamorphic lens flare. 2560×1440, BF16, 187s.</td>
<td>Snow-capped mountain peak at sunset. 2048×2048, Q4 (early), 236s.</td>
</tr>

<tr>
<td><a href="sample_outputs/hero/06_alice_cyberpunk.png"><img src="sample_outputs/hero/06_alice_cyberpunk.png" width="350"/></a></td>
<td><a href="sample_outputs/hero/08_fitness_BF16.png"><img src="sample_outputs/hero/08_fitness_BF16.png" width="350"/></a></td>
</tr>
<tr>
<td>Alice in cyberpunk, neon Cheshire cat hologram. 2048×2048, Q8, 276s.</td>
<td>Fitness influencer mid-deadlift in industrial gym. 1440×2560, BF16, 127s.</td>
</tr>
</table>

More: [`sample_outputs/hero/`](sample_outputs/hero/).

## Variants

| Variant | Repo | Backbone size | RAM (1024) | Quality |
|---|---|---|---|---|
| **BF16** (this repo) | `mlx-community/HiDream-O1-Image-Dev-mlx-bf16` | 17.5 GB | 16 GB | ✅ Clean across all trained dims |
| Q8 | [`mlx-community/HiDream-O1-Image-Dev-mlx-q8`](https://huggingface.co/mlx-community/HiDream-O1-Image-Dev-mlx-q8) | 10 GB | 11.5 GB | ⚠ Clean at square dims, grid at non-square |
| Q6 | [`mlx-community/HiDream-O1-Image-Dev-mlx-q6`](https://huggingface.co/mlx-community/HiDream-O1-Image-Dev-mlx-q6) | 8 GB | 8.5 GB | ⚠ Clean at square dims, grid at non-square |

**Q4 was tested and rejected** — brightness collapses, every image ships dark.

### Why BF16 is the safe default

Per-group dequantization rounding (Q6/Q8) compounds across the 36 decoder layers and shows as a visible 32-pixel grid in flat regions (skies, walls, water), specifically at **non-square trained dimensions** like 1440×2560 or 3104×1312. BF16 matches the upstream's `torch_dtype=torch.float32 + autocast(bfloat16)` precision and is the only quant clean across all trained dimensions.

If your workflow is square-only (1024×1024, 2048×2048) and you're RAM-constrained, **Q6 is half the size and 2× faster** — no quality loss at those dims. Use Q6 on a 16 GB Mac, BF16 on 32 GB+.

## Install

Requires macOS on Apple Silicon (M1 or newer). Tested on macOS 14+ with a 64 GB Mac Studio.

### Quick start (download pre-converted weights — recommended)

```bash
# Clone the repo (code, docs, samples)
hf download mlx-community/HiDream-O1-Image-Dev-mlx-bf16 --local-dir hidream-o1-mlx
cd hidream-o1-mlx

# Set up the venv
uv venv --python 3.11
uv pip install -r requirements.txt

# Generate (model files are at the repo root — pass --model-path .)
.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
  --model-path . \
  --prompt "your prompt here" \
  --output out.png
```

### Or convert from upstream weights yourself

```bash
git clone https://huggingface.co/mlx-community/HiDream-O1-Image-Dev-mlx-bf16
cd HiDream-O1-Image-Dev-mlx-bf16
uv venv --python 3.11
uv pip install -r requirements.txt

# Convert the upstream HF weights to MLX BF16 (~5 minutes, requires ~50 GB free disk)
.venv/bin/python scripts/hidream_o1/convert_hidream_o1_to_mlx.py \
  --hf-source HiDream-ai/HiDream-O1-Image-Dev \
  --out-dir mlx_models/hidream-o1-dev-bf16 \
  --bits 16
```

## Usage

```bash
# Single image, default 1024×1024 BF16
.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
  --model-path mlx_models/hidream-o1-dev-bf16 \
  --prompt "your prompt here" \
  --output sample_outputs/whatever.png \
  --seed 42

# Higher resolution (2048×2048 = upstream default)
.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
  --model-path mlx_models/hidream-o1-dev-bf16 \
  --prompt "..." \
  --width 2048 --height 2048 \
  --output sample_outputs/big.png

# Vertical / cinema (auto-snaps to nearest trained ratio)
.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
  --model-path mlx_models/hidream-o1-dev-bf16 \
  --prompt "..." \
  --width 1440 --height 2560 \
  --output sample_outputs/portrait.png

# Instruction-based edit (one ref image)
.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
  --model-path mlx_models/hidream-o1-dev-bf16 \
  --prompt "change the chef's white jacket to a bright red chef jacket, same kitchen, same pose, photorealistic" \
  --output sample_outputs/edit_red_jacket.png \
  --ref-images /path/to/chef.jpg \
  --seed 42

# Multi-reference subject personalization (2-3 refs)
.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
  --model-path mlx_models/hidream-o1-dev-bf16 \
  --prompt "the person from reference 1 standing in the location from reference 2, golden hour, photorealistic" \
  --output sample_outputs/multi_ref.png \
  --ref-images /path/to/person.jpg /path/to/place.jpg \
  --seed 42
```

### Trained resolutions

HiDream-O1 was trained on a fixed list of resolutions. The generator auto-snaps to the closest. Off-spec dims produce visible patch artifacts. The trained list:

```
2048×2048, 2304×1728, 1728×2304, 2560×1440, 1440×2560,
2496×1664, 1664×2496, 3104×1312, 1312×3104, 2304×1792, 1792×2304
```

## Prompt tips for realism

HiDream is responsive to camera/film terminology. To avoid the AI-glossy look:

- Lead with `masterpiece, best quality` (community-found responder phrase)
- Subject + Actions → Setting → Style → Details ordering
- Specify equipment: `Leica M6 with Kodak Tri-X 400`, `Pentax K1000 + Cinestill 800T`, `Hasselblad H6D medium format`
- Reference real photographers: Sebastião Salgado, Saul Leiter, Wim Wenders, Annie Leibovitz, Anders Petersen
- Spell out skin imperfection: "natural pores", "faint laugh lines", "weathered hands", "no retouching"
- Avoid "stunning", "perfect", "beautiful" — they push toward AI-glamour aesthetics

The Dev model uses `guidance_scale=0.0` so negative prompts have no effect — push positive prompts harder instead.

## What's in this repo

```
hidream-o1-mlx/
├── README.md                                 (this file)
├── LICENSE                                   (MIT)
├── requirements.txt                          (mlx-vlm 0.5.0, transformers 5.8+, deps)
├── scripts/hidream_o1/
│   ├── convert_hidream_o1_to_mlx.py          (HF → MLX, BF16 / Q4 / Q6 / Q8)
│   ├── generate_hidream_o1_mlx.py            (T2I generator + experimental edit/multi-ref)
│   ├── hidream_model.py                      (custom heads + forward_generation)
│   ├── pipeline_helpers.py                   (T2I sample, mrope, mask, patchify)
│   └── flow_match.py                         (FlashFlowMatchScheduler in MLX)
├── docs/
│   ├── EVALUATION.md                         (perf + quality findings, A/B vs mflux)
│   ├── HIDREAM_O1_MLX_PORT_REPORT.md         (architecture + weight conversion details)
│   └── PHOSPHENE_INTEGRATION_PLAN.md         (how it slots into a host app)
├── sample_outputs/                           (gallery)
└── mlx_models/                               (where converted weights land)
```

## Performance

| Resolution | Per step | Total (28 steps) | Peak RAM |
|---|---|---|---|
| 1024×1024 | 2.4 s | 67 s | 16 GB |
| 1440×2560 | 4.5 s | 127 s | 16 GB |
| 2048×2048 | 6.7 s | 187 s | 16 GB |
| 3104×1312 | 7.6 s | 213 s | 16 GB |

`mx.compile` gives 0% speedup — the inference loop is bandwidth-bound on the 36-layer BF16 decoder. To go faster you'd need a smaller distillation (none public) or text-cache reuse across denoising steps.

## Status

- ✅ **Text-to-image**: production-quality, BF16 default, ~67 s / 1024×1024 on a 64 GB Mac
- ✅ **Instruction edit (K=1 ref)**: working at BF16. Verified: same chef, same kitchen, same pose, only the jacket colour changed.
- ✅ **Multi-reference subject personalization (K=2-3 refs)**: supported by the upstream architecture and our port; same `--ref-images` flag with multiple paths
- ✅ Native MLX — no PyTorch, no CUDA, no flash-attn at inference time
- ⚠ Edit requires BF16. Q6/Q8 quantization breaks the attention against ref features (degenerate output). The text-to-image path is fine at all quants.

## Acknowledgements

- [HiDream-ai](https://github.com/HiDream-ai) for the original HiDream-O1-Image model + MIT license
- [Blaizzy/mlx-vlm](https://github.com/Blaizzy/mlx-vlm) for the Qwen3-VL MLX backbone (this port reuses their vision tower + decoder layers + mrope-3D wholesale)
- [Apple ml-explore/mlx](https://github.com/ml-explore/mlx) for the MLX framework
- The Civitai community's [HiDream prompt-engineering guide](https://civitai.com/articles/16050/hi-dream-prompt-engineering)

## Citation

If you use this in research, cite the upstream model:

```bibtex
@misc{hidream-o1-image,
  author = {HiDream-ai},
  title = {HiDream-O1-Image: Pixel-Level Unified Transformer},
  year = {2026},
  url = {https://github.com/HiDream-ai/HiDream-O1-Image}
}
```

## License

MIT — see [LICENSE](LICENSE).