Instructions to use mlx-community/HiDream-O1-Image-Dev-mlx-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/HiDream-O1-Image-Dev-mlx-bf16 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir HiDream-O1-Image-Dev-mlx-bf16 mlx-community/HiDream-O1-Image-Dev-mlx-bf16
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
| license: mit | |
| base_model: HiDream-ai/HiDream-O1-Image-Dev | |
| tags: | |
| - mlx | |
| - mlx-vlm | |
| - hidream | |
| - text-to-image | |
| - apple-silicon | |
| - bf16 | |
| language: | |
| - en | |
| pipeline_tag: text-to-image | |
| library_name: mlx | |
| inference: false | |
| authors: | |
| - Mrbizarro | |
| # HiDream-O1-Image-Dev β MLX port for Apple Silicon | |
| > Ported by **[Mrbizarro](https://huggingface.co/Mrbizarro)** Β· MIT licensed Β· published to mlx-community | |
| ## ποΈ Run it one-click in **[Phosphene](https://github.com/mrbizarro/phosphene)** | |
| Phosphene is a free local generative-video panel for Apple Silicon (Mac, M1+). It ships with HiDream-O1 wired into its Image Studio β pick **"HiDream-O1-Image-Dev BF16"** from the engine dropdown and you have native edit + multi-reference support out of the box. No conda, no Python tinkering, no separate venv setup. **[Install Pinokio](https://pinokio.computer)**, then in Pinokio install [Phosphene](https://github.com/mrbizarro/phosphene). | |
| --- | |
| A native MLX port of [HiDream-ai/HiDream-O1-Image-Dev](https://huggingface.co/HiDream-ai/HiDream-O1-Image-Dev) for fast local image generation on Apple Silicon Macs. **No PyTorch, no CUDA, no flash-attn required at inference time.** | |
| **Capabilities** (all native to HiDream-O1, all working in this port): | |
| - **Text-to-image** at 1024Γ1024 / 2048Γ2048 / non-square trained dims | |
| - **Instruction-based image edit** with 1 reference image (e.g. *"change the chef's white jacket to red"* β preserves scene, pose, identity) | |
| - **Multi-reference subject personalization** with 2-3 reference images (compose multiple subjects in a new scene) | |
| HiDream-O1 is an 8B Qwen3-VL-based **unified pixel-patch transformer** β it predicts raw 32Γ32 RGB patches directly through the same backbone that handles text, with no separate VAE. The Dev variant is a 28-step distillation of the 50-step Full model, released under the MIT license. | |
| This port: | |
| - Reuses [`mlx-vlm`](https://github.com/Blaizzy/mlx-vlm)'s Qwen3-VL backbone (vision tower, decoder layers, mrope-3D) | |
| - Adds the three diffusion-side custom heads (`t_embedder1`, `x_embedder`, `final_layer2`) | |
| - Ports the `FlashFlowMatchEulerDiscreteScheduler` and the unified-token-sequence builder | |
| - Ships **BF16 weights** (no quantization β see "Why BF16" below) | |
| ## Hero samples | |
| All generated by the included generator script on a 64 GB Mac Studio. Click any image to open full-resolution. | |
| <table> | |
| <tr> | |
| <td><a href="sample_outputs/hero/04_construction_worker.png"><img src="sample_outputs/hero/04_construction_worker.png" width="350"/></a></td> | |
| <td><a href="sample_outputs/hero/01_tea_master.png"><img src="sample_outputs/hero/01_tea_master.png" width="350"/></a></td> | |
| </tr> | |
| <tr> | |
| <td>Construction worker on a rainy rooftop, Kodak Tri-X B&W. 2048Γ2048, BF16, 213s.</td> | |
| <td>Elderly Japanese tea master holding a ceramic cup. 1024Γ1024, Q6 (showcase), 36s.</td> | |
| </tr> | |
| <tr> | |
| <td><a href="sample_outputs/hero/02_tropical_beach.png"><img src="sample_outputs/hero/02_tropical_beach.png" width="350"/></a></td> | |
| <td><a href="sample_outputs/hero/07_kitchen_morning.png"><img src="sample_outputs/hero/07_kitchen_morning.png" width="350"/></a></td> | |
| </tr> | |
| <tr> | |
| <td>Tropical beach with turquoise water and palms. 1024Γ1024, Q8, 67s.</td> | |
| <td>Candid morning portrait, woman with coffee + toast, soft window light. 1440Γ2560, BF16, 127s.</td> | |
| </tr> | |
| <tr> | |
| <td><a href="sample_outputs/hero/03_astronaut.png"><img src="sample_outputs/hero/03_astronaut.png" width="350"/></a></td> | |
| <td><a href="sample_outputs/hero/05_mountain_peak.png"><img src="sample_outputs/hero/05_mountain_peak.png" width="350"/></a></td> | |
| </tr> | |
| <tr> | |
| <td>Astronaut in space-station corridor, anamorphic lens flare. 2560Γ1440, BF16, 187s.</td> | |
| <td>Snow-capped mountain peak at sunset. 2048Γ2048, Q4 (early), 236s.</td> | |
| </tr> | |
| <tr> | |
| <td><a href="sample_outputs/hero/06_alice_cyberpunk.png"><img src="sample_outputs/hero/06_alice_cyberpunk.png" width="350"/></a></td> | |
| <td><a href="sample_outputs/hero/08_fitness_BF16.png"><img src="sample_outputs/hero/08_fitness_BF16.png" width="350"/></a></td> | |
| </tr> | |
| <tr> | |
| <td>Alice in cyberpunk, neon Cheshire cat hologram. 2048Γ2048, Q8, 276s.</td> | |
| <td>Fitness influencer mid-deadlift in industrial gym. 1440Γ2560, BF16, 127s.</td> | |
| </tr> | |
| </table> | |
| More: [`sample_outputs/hero/`](sample_outputs/hero/). | |
| ## Variants | |
| | Variant | Repo | Backbone size | RAM (1024) | Quality | | |
| |---|---|---|---|---| | |
| | **BF16** (this repo) | `mlx-community/HiDream-O1-Image-Dev-mlx-bf16` | 17.5 GB | 16 GB | β Clean across all trained dims | | |
| | Q8 | [`mlx-community/HiDream-O1-Image-Dev-mlx-q8`](https://huggingface.co/mlx-community/HiDream-O1-Image-Dev-mlx-q8) | 10 GB | 11.5 GB | β Clean at square dims, grid at non-square | | |
| | Q6 | [`mlx-community/HiDream-O1-Image-Dev-mlx-q6`](https://huggingface.co/mlx-community/HiDream-O1-Image-Dev-mlx-q6) | 8 GB | 8.5 GB | β Clean at square dims, grid at non-square | | |
| **Q4 was tested and rejected** β brightness collapses, every image ships dark. | |
| ### Why BF16 is the safe default | |
| Per-group dequantization rounding (Q6/Q8) compounds across the 36 decoder layers and shows as a visible 32-pixel grid in flat regions (skies, walls, water), specifically at **non-square trained dimensions** like 1440Γ2560 or 3104Γ1312. BF16 matches the upstream's `torch_dtype=torch.float32 + autocast(bfloat16)` precision and is the only quant clean across all trained dimensions. | |
| If your workflow is square-only (1024Γ1024, 2048Γ2048) and you're RAM-constrained, **Q6 is half the size and 2Γ faster** β no quality loss at those dims. Use Q6 on a 16 GB Mac, BF16 on 32 GB+. | |
| ## Install | |
| Requires macOS on Apple Silicon (M1 or newer). Tested on macOS 14+ with a 64 GB Mac Studio. | |
| ### Quick start (download pre-converted weights β recommended) | |
| ```bash | |
| # Clone the repo (code, docs, samples) | |
| hf download mlx-community/HiDream-O1-Image-Dev-mlx-bf16 --local-dir hidream-o1-mlx | |
| cd hidream-o1-mlx | |
| # Set up the venv | |
| uv venv --python 3.11 | |
| uv pip install -r requirements.txt | |
| # Generate (model files are at the repo root β pass --model-path .) | |
| .venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \ | |
| --model-path . \ | |
| --prompt "your prompt here" \ | |
| --output out.png | |
| ``` | |
| ### Or convert from upstream weights yourself | |
| ```bash | |
| git clone https://huggingface.co/mlx-community/HiDream-O1-Image-Dev-mlx-bf16 | |
| cd HiDream-O1-Image-Dev-mlx-bf16 | |
| uv venv --python 3.11 | |
| uv pip install -r requirements.txt | |
| # Convert the upstream HF weights to MLX BF16 (~5 minutes, requires ~50 GB free disk) | |
| .venv/bin/python scripts/hidream_o1/convert_hidream_o1_to_mlx.py \ | |
| --hf-source HiDream-ai/HiDream-O1-Image-Dev \ | |
| --out-dir mlx_models/hidream-o1-dev-bf16 \ | |
| --bits 16 | |
| ``` | |
| ## Usage | |
| ```bash | |
| # Single image, default 1024Γ1024 BF16 | |
| .venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \ | |
| --model-path mlx_models/hidream-o1-dev-bf16 \ | |
| --prompt "your prompt here" \ | |
| --output sample_outputs/whatever.png \ | |
| --seed 42 | |
| # Higher resolution (2048Γ2048 = upstream default) | |
| .venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \ | |
| --model-path mlx_models/hidream-o1-dev-bf16 \ | |
| --prompt "..." \ | |
| --width 2048 --height 2048 \ | |
| --output sample_outputs/big.png | |
| # Vertical / cinema (auto-snaps to nearest trained ratio) | |
| .venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \ | |
| --model-path mlx_models/hidream-o1-dev-bf16 \ | |
| --prompt "..." \ | |
| --width 1440 --height 2560 \ | |
| --output sample_outputs/portrait.png | |
| # Instruction-based edit (one ref image) | |
| .venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \ | |
| --model-path mlx_models/hidream-o1-dev-bf16 \ | |
| --prompt "change the chef's white jacket to a bright red chef jacket, same kitchen, same pose, photorealistic" \ | |
| --output sample_outputs/edit_red_jacket.png \ | |
| --ref-images /path/to/chef.jpg \ | |
| --seed 42 | |
| # Multi-reference subject personalization (2-3 refs) | |
| .venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \ | |
| --model-path mlx_models/hidream-o1-dev-bf16 \ | |
| --prompt "the person from reference 1 standing in the location from reference 2, golden hour, photorealistic" \ | |
| --output sample_outputs/multi_ref.png \ | |
| --ref-images /path/to/person.jpg /path/to/place.jpg \ | |
| --seed 42 | |
| ``` | |
| ### Trained resolutions | |
| HiDream-O1 was trained on a fixed list of resolutions. The generator auto-snaps to the closest. Off-spec dims produce visible patch artifacts. The trained list: | |
| ``` | |
| 2048Γ2048, 2304Γ1728, 1728Γ2304, 2560Γ1440, 1440Γ2560, | |
| 2496Γ1664, 1664Γ2496, 3104Γ1312, 1312Γ3104, 2304Γ1792, 1792Γ2304 | |
| ``` | |
| ## Prompt tips for realism | |
| HiDream is responsive to camera/film terminology. To avoid the AI-glossy look: | |
| - Lead with `masterpiece, best quality` (community-found responder phrase) | |
| - Subject + Actions β Setting β Style β Details ordering | |
| - Specify equipment: `Leica M6 with Kodak Tri-X 400`, `Pentax K1000 + Cinestill 800T`, `Hasselblad H6D medium format` | |
| - Reference real photographers: SebastiΓ£o Salgado, Saul Leiter, Wim Wenders, Annie Leibovitz, Anders Petersen | |
| - Spell out skin imperfection: "natural pores", "faint laugh lines", "weathered hands", "no retouching" | |
| - Avoid "stunning", "perfect", "beautiful" β they push toward AI-glamour aesthetics | |
| The Dev model uses `guidance_scale=0.0` so negative prompts have no effect β push positive prompts harder instead. | |
| ## What's in this repo | |
| ``` | |
| hidream-o1-mlx/ | |
| βββ README.md (this file) | |
| βββ LICENSE (MIT) | |
| βββ requirements.txt (mlx-vlm 0.5.0, transformers 5.8+, deps) | |
| βββ scripts/hidream_o1/ | |
| β βββ convert_hidream_o1_to_mlx.py (HF β MLX, BF16 / Q4 / Q6 / Q8) | |
| β βββ generate_hidream_o1_mlx.py (T2I generator + experimental edit/multi-ref) | |
| β βββ hidream_model.py (custom heads + forward_generation) | |
| β βββ pipeline_helpers.py (T2I sample, mrope, mask, patchify) | |
| β βββ flow_match.py (FlashFlowMatchScheduler in MLX) | |
| βββ docs/ | |
| β βββ EVALUATION.md (perf + quality findings, A/B vs mflux) | |
| β βββ HIDREAM_O1_MLX_PORT_REPORT.md (architecture + weight conversion details) | |
| β βββ PHOSPHENE_INTEGRATION_PLAN.md (how it slots into a host app) | |
| βββ sample_outputs/ (gallery) | |
| βββ mlx_models/ (where converted weights land) | |
| ``` | |
| ## Performance | |
| | Resolution | Per step | Total (28 steps) | Peak RAM | | |
| |---|---|---|---| | |
| | 1024Γ1024 | 2.4 s | 67 s | 16 GB | | |
| | 1440Γ2560 | 4.5 s | 127 s | 16 GB | | |
| | 2048Γ2048 | 6.7 s | 187 s | 16 GB | | |
| | 3104Γ1312 | 7.6 s | 213 s | 16 GB | | |
| `mx.compile` gives 0% speedup β the inference loop is bandwidth-bound on the 36-layer BF16 decoder. To go faster you'd need a smaller distillation (none public) or text-cache reuse across denoising steps. | |
| ## Status | |
| - β **Text-to-image**: production-quality, BF16 default, ~67 s / 1024Γ1024 on a 64 GB Mac | |
| - β **Instruction edit (K=1 ref)**: working at BF16. Verified: same chef, same kitchen, same pose, only the jacket colour changed. | |
| - β **Multi-reference subject personalization (K=2-3 refs)**: supported by the upstream architecture and our port; same `--ref-images` flag with multiple paths | |
| - β Native MLX β no PyTorch, no CUDA, no flash-attn at inference time | |
| - β Edit requires BF16. Q6/Q8 quantization breaks the attention against ref features (degenerate output). The text-to-image path is fine at all quants. | |
| ## Acknowledgements | |
| - [HiDream-ai](https://github.com/HiDream-ai) for the original HiDream-O1-Image model + MIT license | |
| - [Blaizzy/mlx-vlm](https://github.com/Blaizzy/mlx-vlm) for the Qwen3-VL MLX backbone (this port reuses their vision tower + decoder layers + mrope-3D wholesale) | |
| - [Apple ml-explore/mlx](https://github.com/ml-explore/mlx) for the MLX framework | |
| - The Civitai community's [HiDream prompt-engineering guide](https://civitai.com/articles/16050/hi-dream-prompt-engineering) | |
| ## Citation | |
| If you use this in research, cite the upstream model: | |
| ```bibtex | |
| @misc{hidream-o1-image, | |
| author = {HiDream-ai}, | |
| title = {HiDream-O1-Image: Pixel-Level Unified Transformer}, | |
| year = {2026}, | |
| url = {https://github.com/HiDream-ai/HiDream-O1-Image} | |
| } | |
| ``` | |
| ## License | |
| MIT β see [LICENSE](LICENSE). | |