Instructions to use mlx-community/HiDream-O1-Image-Dev-mlx-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/HiDream-O1-Image-Dev-mlx-bf16 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir HiDream-O1-Image-Dev-mlx-bf16 mlx-community/HiDream-O1-Image-Dev-mlx-bf16
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
| # HiDream-O1-Image-Dev (Q8 MLX) β evaluation | |
| **Setup:** lab branch `perf-lab-hidream-o1-mlx`, mlx-vlm 0.5.0 + mlx 0.31.2, Mac Studio (64 GB). | |
| **Recipe:** Dev β 28 steps, FlashFlowMatch, `s_noise=7.5`, `noise_clip_std=2.5`, `shift=1.0`. | |
| **All times** are honest wall-clock with `mx.eval` per step. **All RAM** is peak `maximum resident set size`. | |
| ## Q6 showcase verification (2026-05-09 evening) | |
| Re-ran the same 10-prompt battery at Q6 with identical seeds. **All 10 are visually equivalent or better than the Q8 versions:** | |
| - 9/10 are near-pixel-identical aesthetics (different latent noise from quant differences yields same compositions / lighting / subjects) | |
| - **10 (text rendering) is visibly better at Q6** β "BLOOM CAFE" neon sign is crisp at Q6 vs a glitched "M" at Q8 | |
| Per-image timing was rock-steady at **35.9 s** (1.28 s/step). Total battery time: ~6 minutes vs ~12 minutes at Q8. | |
| Outputs: `sample_outputs/showcase_q6/` (compare against `sample_outputs/showcase/` for the Q8 originals). | |
| ## Battery: 10 prompts, 1024Γ1024, all Q8 | |
| | # | Genre | Prompt summary | Result | Time | | |
| |---|---|---|---|---| | |
| | 01 | photo portrait | elderly Japanese tea master | **Excellent** β face character, gentle smile, paper screens, calligraphy | 81.5 s* | | |
| | 02 | anime / illustration | pink-haired girl on Tokyo rooftop at dusk | **Excellent** β anime style + cherry blossoms + neon city below | 65.3 s | | |
| | 03 | macro photo | dewdrop on spiderweb | **Excellent** β refractions, blurred leaf bg, crisp web detail | 65.9 s | | |
| | 04 | architecture | futuristic library, holographic displays | **Excellent** β vaulted ceiling, stained glass, holo screens | 66.3 s | | |
| | 05 | surreal painting | whale floating over desert at sunset | **Excellent** β magical realism, painterly clouds | 65.8 s | | |
| | 06 | food flatlay | rustic Italian breakfast on marble | **Excellent** β golden croissants, espresso, berries, soft light | 66.4 s | | |
| | 07 | cinematic action | samurai mid-leap with katana, Mt. Fuji bg | **Excellent** β dynamic pose, cherry blossoms, real mountain | 66.1 s | | |
| | 08 | fantasy | dragon on crystal mountain with aurora | **Excellent** β iridescent scales, snow swirling, aurora visible | 66.4 s | | |
| | 09 | wildlife photo | snow leopard staring at camera | **Excellent** β direct gaze, falling snow, mountain bg | 67.1 s | | |
| | 10 | text rendering | "BLOOM CAFE" pink neon diner | **Good** β sign legible (small "M" glitch), retro diner, rainy street | 67.1 s | | |
| *Image 01 included cold model load (~12-15 s). | |
| **Steady-state per-image: 65-67 s at 1024Γ1024 Q8.** Dead-consistent across genres. | |
| ## Honest timings | |
| | Resolution | Quant | Per step | Total (28 steps) | Peak RAM | | |
| |---|---|---|---|---| | |
| | 512Γ512 | Q4 | 0.89 s | 24.9 s | ~6 GB | | |
| | 1024Γ1024 | Q4 | 2.37 s | 66 s | ~6 GB | | |
| | 1024Γ1024 | **Q6** | **1.30 s** | **36 s** | **~8.5 GB** | | |
| | 1024Γ1024 | Q8 | 2.36 s | 66 s | ~11.5 GB | | |
| | 1280Γ704 | Q8 | 2.53 s | 70.7 s | ~7 GB | | |
| | 704Γ1280 | Q8 | 2.35 s | 65.9 s | ~3 GB (warm cache) | | |
| | 2048Γ2048 | Q4 | 8.44 s | 236 s | ~7.2 GB | | |
| | 2048Γ2048 | Q8 | 9.86 s | 276 s | ~10.8 GB | | |
| **Q6 is the sweet spot.** 2Γ faster than Q8 at 1024 with the same prompt fidelity (cat in sunlit kitchen + beach with palm trees both rendered identically to Q8 outputs). 30% less RAM. The bandwidth-bound theory holds: fewer bits per param β less weight bandwidth β faster per-step. | |
| **Q4 corrupts brightness** (ships dark) so the speed of Q4 vs Q6 is academic β never use Q4 for production. Q6 has the speed and Q8 has the steady-state safety; Q6 wins on perf, Q8 wins on a deterministic upper bound on RAM. | |
| ## Where HiDream-O1-Image-Dev shines | |
| - **Subject identity** β every prompt subject was rendered correctly. No "vibrant orange tabby" β cat-shape-blob. The model knows what things look like. | |
| - **Multi-element scenes** β samurai + Fuji + cherry blossoms; cyberpunk Alice + neon Cheshire cat + circuit dress + rain. Composition stays coherent. | |
| - **Style adherence** β anime β photorealism β oil painting β macro. Got all four right. | |
| - **Light realism** β the architecture image's light through stained glass; the food flatlay's morning warmth; the action scene's sunset rim lighting. Light feels real, not stamped on. | |
| - **Text rendering** (limited) β "BLOOM CAFE" in neon was readable. Better than most diffusion models; not as clean as a model with explicit OCR pretraining. | |
| ## Where it's weak | |
| - **Patch-grid artifact** in flat regions. PATCH_SIZE=32 with no overlap β visible 32Γ32 grid in skies, water, walls. Most visible at low-frequency content. Architectural β not fixable without retraining or an overlap-blending postprocess. | |
| - **Q4 brightness collapse** β Q4 desaturates and darkens everything. Q8 fixes it. **Ship Q8.** | |
| - **Hands** β hands when present in scenes (e.g. tea master holding cup) look fine at moderate detail, but the model isn't immune to the standard diffusion hand failure modes; haven't stress-tested. | |
| - **Dense long text** β "BLOOM CAFE" is short and structured. A paragraph of text would likely fall apart. | |
| - **Speed at 2048** β 4 minutes per image is slow for iterative work. Fine for a final pass. | |
| ## Sweet spot | |
| **1024Γ1024, Q6, default Dev recipe, ~36 s/image, ~8.5 GB RAM.** Bright/colourful output equivalent to Q8, half the wall time, 30% less RAM. 512 is fast (~25 s) but loses detail. 2048 is gorgeous but iterative-unfriendly. | |
| **Quant decision tree:** | |
| - 16 GB Mac β don't run HiDream; use mflux Z-Image-Turbo | |
| - 32 GB Mac β Q6 is comfortable, Q8 leaves no headroom alongside LTX | |
| - 64 GB Mac β Q6 default; Q8 only when you want deterministic upper-bound RAM | |
| ## A/B vs mflux Z-Image-Turbo | |
| Same prompts, same seeds, both at 1024Γ1024. | |
| | # | Prompt | HiDream Q8 | Z-Image-Turbo Q4 (mflux) | Subjective winner | | |
| |---|---|---|---|---| | |
| | 1 | tea master | [v3](../sample_outputs/showcase/01_portrait_photo.png) β wide scene, paper screens, calligraphy | [zimg](../sample_outputs/ab_mflux/01_portrait_zimage.png) β tighter portrait, gray garment, smile | **Tie** β different framings, both excellent | | |
| | 2 | sunlit beach | [v3](../sample_outputs/v3_1024_beach_q8.png) β turquoise water, palm trees, beach chair | [zimg](../sample_outputs/ab_mflux/02_beach_zimage.png) β vivid blue water, palms, big sand foreground | **Tie** β both nail the prompt | | |
| | 3 | alice cyberpunk | [v3](../sample_outputs/v3_alice_horizontal_q8.png) (horizontal) β clear dress + face + Cheshire | [zimg](../sample_outputs/ab_mflux/03_alice_zimage.png) β more painterly, atmospheric Cheshire silhouette | **HiDream** for face/dress detail; **Z-Image** for atmosphere | | |
| ### Speed + RAM (measured, not estimated) | |
| | Engine | Steps | Wall (1024) | Per step | Peak RAM | | |
| |---|---|---|---|---| | |
| | HiDream-O1-Dev / Q8 | 28 | **67 s** | 2.41 s | **11.5 GB** | | |
| | Z-Image-Turbo / Q4 | 9 | 80 s | 8.85 s | **5.9β29.4 GB** (varies by prompt) | | |
| Surprises: | |
| - HiDream is **faster per image** despite needing 28 steps vs Z-Image-Turbo's 9 β Z-Image's per-step cost is ~3.7Γ HiDream's. | |
| - Z-Image's peak RAM **varied wildly across prompts** (5.9 GB for portrait, 29.4 GB for the alice cyberpunk). HiDream's peak was steady at ~11.5 GB regardless of prompt complexity. | |
| ### Verdict | |
| Both are excellent local engines. Pick by the workload: | |
| - **Default/compact**: keep **Z-Image-Turbo** β 5.9 GB RAM on most prompts, runs anywhere. | |
| - **Hero shots / max prompt fidelity**: **HiDream-O1-Q8** β faster wall time, deterministic memory, more environmental detail in the output. | |
| - **Editing / multi-ref**: keep **mflux qwen-edit** β HiDream lab pipeline doesn't support refs yet. | |
| ## Patch-grid post-blend experiment | |
| Implemented `--blend-seams <radius>` post-process in `generate_hidream_o1_mlx.py`: after decoding the final image, average a thin band across each 32-pixel patch boundary line (radius=1 β blend the seam row with one neighbour on each side, then 50% blend back into the seam itself). | |
| **Result on the same beach prompt + seed 11 + Q8:** | |
| | Comparison | Mean abs diff (out of 255) | | |
| |---|---| | |
| | baseline vs blend r=1 | 0.18 | | |
| | baseline vs blend r=2 | 0.23 | | |
| Per-row breakdown confirms the blend is **surgical** β only seam rows (every 32) change, by 1β2.7 pixel values; non-seam rows shift by <0.2. So the math is doing exactly what it says. | |
| **But visually**: at Q8 the seam artifact is already mild. The blend's 1β2 pixel-value smoothing is below visual threshold. No win, but no harm β and zero added latency (numpy vector ops on a 1024Γ1024 image are sub-ms). | |
| Bottom line: kept as opt-in flag `--blend-seams 1`. Did not enable by default. The real fix for the patch grid would need overlap-blended patches (architectural change) or a stronger spatial filter (which would visibly blur the image). | |
| ## Software-side speed: nothing left | |
| Tested `mx.compile` on the forward pass: **0% improvement** (2.366 s/step compiled vs 2.368 s/step uncompiled). The forward is already bandwidth-bound by the 36-layer Q8 decoder's matmul stream β MLX is already at near-GPU-saturation. Same conclusion for `mx.fast.scaled_dot_product_attention` (already used inside mlx-vlm's Qwen3VLAttention). | |
| **The path to faster is architectural, not algorithmic:** | |
| - Fewer steps (would need a smaller distillation; Dev is already the distilled variant) | |
| - Smaller backbone (would need re-distillation onto a 4B Qwen3-VL β no public version) | |
| - Caching the text-portion hidden states across denoising steps β possible but invasive (would need to subclass mlx-vlm's Qwen3VLModel; ~2-5% speedup at best since text is <2% of seq length) | |
| ## Verdict | |
| - **Working.** Q8 produces real, prompt-faithful, high-quality images at ~67 s/1024. | |
| - **No more easy speedups.** The lab's inference loop is already at the floor for this architecture on this hardware. | |
| - **Patch artifacts are real but mild.** Low-frequency regions show a 32-pixel grid. Subjects-with-content scenes hide it well. | |
| - **Q8 is the only acceptable quant.** Q4 ships dark. If we ever want a smaller variant, would need different bit packing or selective Q6. | |
| ## Recommendation for Phosphene | |
| Slot it in as a third local engine alongside `mflux Z-Image-Turbo` (compact tier) and `mflux FLUX.2-klein-4B` (comfortable tier). Mark HiDream as **comfortable+** (32 GB+) due to the 11.5 GB working set. Don't make it the default β it's slower per image and uses more RAM than Z-Image-Turbo. Make it **the option** for users who want max prompt fidelity and license clarity (MIT, no NC restriction). | |
| See [PHOSPHENE_INTEGRATION_PLAN.md](PHOSPHENE_INTEGRATION_PLAN.md) for the patch. | |