Instructions to use mlx-community/HiDream-O1-Image-Dev-mlx-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/HiDream-O1-Image-Dev-mlx-bf16 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir HiDream-O1-Image-Dev-mlx-bf16 mlx-community/HiDream-O1-Image-Dev-mlx-bf16
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
File size: 10,632 Bytes
ffe929e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 | # HiDream-O1-Image-Dev (Q8 MLX) β evaluation
**Setup:** lab branch `perf-lab-hidream-o1-mlx`, mlx-vlm 0.5.0 + mlx 0.31.2, Mac Studio (64 GB).
**Recipe:** Dev β 28 steps, FlashFlowMatch, `s_noise=7.5`, `noise_clip_std=2.5`, `shift=1.0`.
**All times** are honest wall-clock with `mx.eval` per step. **All RAM** is peak `maximum resident set size`.
## Q6 showcase verification (2026-05-09 evening)
Re-ran the same 10-prompt battery at Q6 with identical seeds. **All 10 are visually equivalent or better than the Q8 versions:**
- 9/10 are near-pixel-identical aesthetics (different latent noise from quant differences yields same compositions / lighting / subjects)
- **10 (text rendering) is visibly better at Q6** β "BLOOM CAFE" neon sign is crisp at Q6 vs a glitched "M" at Q8
Per-image timing was rock-steady at **35.9 s** (1.28 s/step). Total battery time: ~6 minutes vs ~12 minutes at Q8.
Outputs: `sample_outputs/showcase_q6/` (compare against `sample_outputs/showcase/` for the Q8 originals).
## Battery: 10 prompts, 1024Γ1024, all Q8
| # | Genre | Prompt summary | Result | Time |
|---|---|---|---|---|
| 01 | photo portrait | elderly Japanese tea master | **Excellent** β face character, gentle smile, paper screens, calligraphy | 81.5 s* |
| 02 | anime / illustration | pink-haired girl on Tokyo rooftop at dusk | **Excellent** β anime style + cherry blossoms + neon city below | 65.3 s |
| 03 | macro photo | dewdrop on spiderweb | **Excellent** β refractions, blurred leaf bg, crisp web detail | 65.9 s |
| 04 | architecture | futuristic library, holographic displays | **Excellent** β vaulted ceiling, stained glass, holo screens | 66.3 s |
| 05 | surreal painting | whale floating over desert at sunset | **Excellent** β magical realism, painterly clouds | 65.8 s |
| 06 | food flatlay | rustic Italian breakfast on marble | **Excellent** β golden croissants, espresso, berries, soft light | 66.4 s |
| 07 | cinematic action | samurai mid-leap with katana, Mt. Fuji bg | **Excellent** β dynamic pose, cherry blossoms, real mountain | 66.1 s |
| 08 | fantasy | dragon on crystal mountain with aurora | **Excellent** β iridescent scales, snow swirling, aurora visible | 66.4 s |
| 09 | wildlife photo | snow leopard staring at camera | **Excellent** β direct gaze, falling snow, mountain bg | 67.1 s |
| 10 | text rendering | "BLOOM CAFE" pink neon diner | **Good** β sign legible (small "M" glitch), retro diner, rainy street | 67.1 s |
*Image 01 included cold model load (~12-15 s).
**Steady-state per-image: 65-67 s at 1024Γ1024 Q8.** Dead-consistent across genres.
## Honest timings
| Resolution | Quant | Per step | Total (28 steps) | Peak RAM |
|---|---|---|---|---|
| 512Γ512 | Q4 | 0.89 s | 24.9 s | ~6 GB |
| 1024Γ1024 | Q4 | 2.37 s | 66 s | ~6 GB |
| 1024Γ1024 | **Q6** | **1.30 s** | **36 s** | **~8.5 GB** |
| 1024Γ1024 | Q8 | 2.36 s | 66 s | ~11.5 GB |
| 1280Γ704 | Q8 | 2.53 s | 70.7 s | ~7 GB |
| 704Γ1280 | Q8 | 2.35 s | 65.9 s | ~3 GB (warm cache) |
| 2048Γ2048 | Q4 | 8.44 s | 236 s | ~7.2 GB |
| 2048Γ2048 | Q8 | 9.86 s | 276 s | ~10.8 GB |
**Q6 is the sweet spot.** 2Γ faster than Q8 at 1024 with the same prompt fidelity (cat in sunlit kitchen + beach with palm trees both rendered identically to Q8 outputs). 30% less RAM. The bandwidth-bound theory holds: fewer bits per param β less weight bandwidth β faster per-step.
**Q4 corrupts brightness** (ships dark) so the speed of Q4 vs Q6 is academic β never use Q4 for production. Q6 has the speed and Q8 has the steady-state safety; Q6 wins on perf, Q8 wins on a deterministic upper bound on RAM.
## Where HiDream-O1-Image-Dev shines
- **Subject identity** β every prompt subject was rendered correctly. No "vibrant orange tabby" β cat-shape-blob. The model knows what things look like.
- **Multi-element scenes** β samurai + Fuji + cherry blossoms; cyberpunk Alice + neon Cheshire cat + circuit dress + rain. Composition stays coherent.
- **Style adherence** β anime β photorealism β oil painting β macro. Got all four right.
- **Light realism** β the architecture image's light through stained glass; the food flatlay's morning warmth; the action scene's sunset rim lighting. Light feels real, not stamped on.
- **Text rendering** (limited) β "BLOOM CAFE" in neon was readable. Better than most diffusion models; not as clean as a model with explicit OCR pretraining.
## Where it's weak
- **Patch-grid artifact** in flat regions. PATCH_SIZE=32 with no overlap β visible 32Γ32 grid in skies, water, walls. Most visible at low-frequency content. Architectural β not fixable without retraining or an overlap-blending postprocess.
- **Q4 brightness collapse** β Q4 desaturates and darkens everything. Q8 fixes it. **Ship Q8.**
- **Hands** β hands when present in scenes (e.g. tea master holding cup) look fine at moderate detail, but the model isn't immune to the standard diffusion hand failure modes; haven't stress-tested.
- **Dense long text** β "BLOOM CAFE" is short and structured. A paragraph of text would likely fall apart.
- **Speed at 2048** β 4 minutes per image is slow for iterative work. Fine for a final pass.
## Sweet spot
**1024Γ1024, Q6, default Dev recipe, ~36 s/image, ~8.5 GB RAM.** Bright/colourful output equivalent to Q8, half the wall time, 30% less RAM. 512 is fast (~25 s) but loses detail. 2048 is gorgeous but iterative-unfriendly.
**Quant decision tree:**
- 16 GB Mac β don't run HiDream; use mflux Z-Image-Turbo
- 32 GB Mac β Q6 is comfortable, Q8 leaves no headroom alongside LTX
- 64 GB Mac β Q6 default; Q8 only when you want deterministic upper-bound RAM
## A/B vs mflux Z-Image-Turbo
Same prompts, same seeds, both at 1024Γ1024.
| # | Prompt | HiDream Q8 | Z-Image-Turbo Q4 (mflux) | Subjective winner |
|---|---|---|---|---|
| 1 | tea master | [v3](../sample_outputs/showcase/01_portrait_photo.png) β wide scene, paper screens, calligraphy | [zimg](../sample_outputs/ab_mflux/01_portrait_zimage.png) β tighter portrait, gray garment, smile | **Tie** β different framings, both excellent |
| 2 | sunlit beach | [v3](../sample_outputs/v3_1024_beach_q8.png) β turquoise water, palm trees, beach chair | [zimg](../sample_outputs/ab_mflux/02_beach_zimage.png) β vivid blue water, palms, big sand foreground | **Tie** β both nail the prompt |
| 3 | alice cyberpunk | [v3](../sample_outputs/v3_alice_horizontal_q8.png) (horizontal) β clear dress + face + Cheshire | [zimg](../sample_outputs/ab_mflux/03_alice_zimage.png) β more painterly, atmospheric Cheshire silhouette | **HiDream** for face/dress detail; **Z-Image** for atmosphere |
### Speed + RAM (measured, not estimated)
| Engine | Steps | Wall (1024) | Per step | Peak RAM |
|---|---|---|---|---|
| HiDream-O1-Dev / Q8 | 28 | **67 s** | 2.41 s | **11.5 GB** |
| Z-Image-Turbo / Q4 | 9 | 80 s | 8.85 s | **5.9β29.4 GB** (varies by prompt) |
Surprises:
- HiDream is **faster per image** despite needing 28 steps vs Z-Image-Turbo's 9 β Z-Image's per-step cost is ~3.7Γ HiDream's.
- Z-Image's peak RAM **varied wildly across prompts** (5.9 GB for portrait, 29.4 GB for the alice cyberpunk). HiDream's peak was steady at ~11.5 GB regardless of prompt complexity.
### Verdict
Both are excellent local engines. Pick by the workload:
- **Default/compact**: keep **Z-Image-Turbo** β 5.9 GB RAM on most prompts, runs anywhere.
- **Hero shots / max prompt fidelity**: **HiDream-O1-Q8** β faster wall time, deterministic memory, more environmental detail in the output.
- **Editing / multi-ref**: keep **mflux qwen-edit** β HiDream lab pipeline doesn't support refs yet.
## Patch-grid post-blend experiment
Implemented `--blend-seams <radius>` post-process in `generate_hidream_o1_mlx.py`: after decoding the final image, average a thin band across each 32-pixel patch boundary line (radius=1 β blend the seam row with one neighbour on each side, then 50% blend back into the seam itself).
**Result on the same beach prompt + seed 11 + Q8:**
| Comparison | Mean abs diff (out of 255) |
|---|---|
| baseline vs blend r=1 | 0.18 |
| baseline vs blend r=2 | 0.23 |
Per-row breakdown confirms the blend is **surgical** β only seam rows (every 32) change, by 1β2.7 pixel values; non-seam rows shift by <0.2. So the math is doing exactly what it says.
**But visually**: at Q8 the seam artifact is already mild. The blend's 1β2 pixel-value smoothing is below visual threshold. No win, but no harm β and zero added latency (numpy vector ops on a 1024Γ1024 image are sub-ms).
Bottom line: kept as opt-in flag `--blend-seams 1`. Did not enable by default. The real fix for the patch grid would need overlap-blended patches (architectural change) or a stronger spatial filter (which would visibly blur the image).
## Software-side speed: nothing left
Tested `mx.compile` on the forward pass: **0% improvement** (2.366 s/step compiled vs 2.368 s/step uncompiled). The forward is already bandwidth-bound by the 36-layer Q8 decoder's matmul stream β MLX is already at near-GPU-saturation. Same conclusion for `mx.fast.scaled_dot_product_attention` (already used inside mlx-vlm's Qwen3VLAttention).
**The path to faster is architectural, not algorithmic:**
- Fewer steps (would need a smaller distillation; Dev is already the distilled variant)
- Smaller backbone (would need re-distillation onto a 4B Qwen3-VL β no public version)
- Caching the text-portion hidden states across denoising steps β possible but invasive (would need to subclass mlx-vlm's Qwen3VLModel; ~2-5% speedup at best since text is <2% of seq length)
## Verdict
- **Working.** Q8 produces real, prompt-faithful, high-quality images at ~67 s/1024.
- **No more easy speedups.** The lab's inference loop is already at the floor for this architecture on this hardware.
- **Patch artifacts are real but mild.** Low-frequency regions show a 32-pixel grid. Subjects-with-content scenes hide it well.
- **Q8 is the only acceptable quant.** Q4 ships dark. If we ever want a smaller variant, would need different bit packing or selective Q6.
## Recommendation for Phosphene
Slot it in as a third local engine alongside `mflux Z-Image-Turbo` (compact tier) and `mflux FLUX.2-klein-4B` (comfortable tier). Mark HiDream as **comfortable+** (32 GB+) due to the 11.5 GB working set. Don't make it the default β it's slower per image and uses more RAM than Z-Image-Turbo. Make it **the option** for users who want max prompt fidelity and license clarity (MIT, no NC restriction).
See [PHOSPHENE_INTEGRATION_PLAN.md](PHOSPHENE_INTEGRATION_PLAN.md) for the patch.
|