Instructions to use mlx-community/HiDream-O1-Image-Dev-mlx-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/HiDream-O1-Image-Dev-mlx-bf16 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir HiDream-O1-Image-Dev-mlx-bf16 mlx-community/HiDream-O1-Image-Dev-mlx-bf16
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
HIDREAM-O1-MLX-LAB β STATE
Last updated: 2026-05-09 (session that landed Q8 + Phosphene integration)
TL;DR β where we are
- Q6 is the new sweet spot. 1.30 s/step at 1024Γ1024, ~36 s per image, ~8.5 GB RAM. 2Γ faster than Q8 with equivalent quality.
- Q8 still works (2.36 s/step, 11.5 GB RAM) β keep it for deterministic upper-bound RAM use cases.
- Q4 deleted from disk: ships dark, no reason to keep around (regenerable in 5 min if needed).
- Backbone sizes: Q6 backbone 7.95 GB, Q8 backbone 9.96 GB. Custom heads 75 MB.
- Shipped to Phosphene
devaskind="hidream"inagent/image_engine.py(commits45cad69,962b353). Default model on Phosphene side will be updated to Q6. - Showcase battery + A/B vs mflux Z-Image-Turbo done at Q8. At Q6, HiDream is now ~2Γ faster than Z-Image-Turbo (36s vs 80s) AND has lower deterministic RAM (8.5 GB vs 5.9β29.4 GB variable).
- 19+ sample images in
sample_outputs/. - Lab branch:
perf-lab-hidream-o1-mlx, no remote.
What's been done
| Date | Work | Commit |
|---|---|---|
| 2026-05-09 | Initial scaffolding (Path B chosen) | 746efe9 |
| 2026-05-09 | Wire mlx-vlm Qwen3VLModel directly (4D mask path) | 53eb605 |
| 2026-05-09 | First working images (mushroom 512, cat/beach/portrait 1024 at Q4) | d944a31 |
| 2026-05-09 | Q8 conversion + samples (dark aesthetic was Q4, not the model) | 2bf029a |
| 2026-05-09 | Showcase battery + evaluation + Phosphene plan | 0bac049 |
| 2026-05-09 | Phosphene integration shipped to dev |
phos 45cad69 |
| 2026-05-09 | A/B vs mflux Z-Image-Turbo on 3 prompts | 2761ad8 |
| 2026-05-09 | Phosphene IMAGE_GEN_RESEARCH doc updated | phos 962b353 |
| 2026-05-09 | --blend-seams post-process (opt-in, below-threshold at Q8) | 0583356 |
| 2026-05-09 | Q6 = sweet spot (2Γ faster than Q8, same quality) β Phosphene default switched | f4fb0ba + phos 8a48953 |
| 2026-05-09 | Q6 verified across 10-prompt showcase battery | 4d3f18c |
| 2026-05-09 | Edit/multi-ref scaffold (WIP β runs but output degenerate) | 525b7ec |
| 2026-05-09 | BF16 default β Q4/Q6/Q8 all show patch-grid at non-square dims | (next) |
| 2026-05-09 | Phosphene default switched to BF16 | phos af94bd0 |
| 2026-05-09 | OSS release prep: HF model card, LICENSE, requirements, gitignore | (next) |
Known characteristics (not bugs)
- Patch grid in flat regions β architectural (PATCH_SIZE=32 with no overlap). Mild at Q8.
--blend-seams 1is opt-in but doesn't visibly help. - Text rendering β short, structured signs work ("BLOOM CAFE"). Long text falls apart.
- Deterministic per-prompt RAM β 11.5 GB at 1024 Q8 regardless of prompt complexity. Z-Image-Turbo varies wildly (5.9β29.4 GB).
Open work / next session candidates
Pick from these, listed roughly cheapest-first:
2048Γ2048 Q8 generation passβ DONE 2026-05-09.sample_outputs/v4_2048_alice_q8.pngβ 276 s (9.86 s/step), peak RAM 10.8 GB. Q8 at 2048 is slower per step than Q4 (10s vs 8.4s) due to bandwidth. Output is showcase-grade: detailed cybernetic dress, holographic Cheshire cat, near-legible neon signs.- Test the Phosphene integration through the dev panel UI (port 8199). Generate one shot via the Image Studio dropdown, confirm pill goes green, the PNG lands.
- Edit / multi-reference path β SCAFFOLD LANDED, NEEDS DEBUGGING.
build_edit_text_sample,resize_pilimage,calculate_dimensions,patchify_ref_imageall ported from upstream pipeline.py + utils.py.--ref-imagesflag wired in generate_hidream_o1_mlx.py.precompute_text_embeds_with_visionprecomputes the text+vision embeds once before the loop (since they don't change with timestep) β a meaningful perf win.- Smoke test (synthesized two-color ref, K=1, 28 steps, Q6) runs end-to-end without errors but output is uniform tan/khaki. T2I path with same prompt+seed produces a vibrant abstract correctly, so the model and weights are fine.
- Debugging done so far (see
scripts/hidream_o1/_edit_diag.pyand_precompute_diag.py):- All shapes verified correct (input_ids 174 with 144 image-placeholders, vision tower outputs 144 features, vinput_mask = 256 tgt + 256 ref, position_ids 686 covering all spans).
- Vision feature scatter verified mathematically correct β at image_token positions
combinedequalsimage_featuresexactly (diff=0); at text positionscombinedequalsembed_tokens(input_ids)exactly (diff=0). Vision features are well-behaved (mean ~0, std ~0.4). - Position_ids structure looks right β text positions are sequential, target span gets fix_point=4096 base (per upstream), ref diffusion span continues sequentially.
- Remaining suspects (in order of likelihood):
- Mask construction: maybe text-row causal needs to ALSO see the K image_placeholder positions inside proc.input_ids? Upstream
_run_decoder_flashhas special handling β the non-flash 4D mask path may treat text positions as needing to see embedded vision features. Worth re-reading qwen3_vl_transformers.py:1486-1520. - Position_ids semantic alignment: my appended-vinputs at positions [174..686) get mrope codes from input_ids_pad's vision_tokens portion, but maybe these need to match the appended embedding ORDER not just their positions in input_ids_pad.
- bf16 underflow in attention with the larger 686-token sequence vs T2I's 268.
- Mask construction: maybe text-row causal needs to ALSO see the K image_placeholder positions inside proc.input_ids? Upstream
- Samples:
sample_outputs/v6_edit_smoke.png(degenerate, synthesized 2-color ref),sample_outputs/v6_edit_cat_real.png(degenerate, real cat photo as ref),sample_outputs/v6_edit_t2i_baseline.png(T2I works fine same prompt+seed). - This is the single biggest open item. Would let HiDream replace mflux qwen-edit functionally.
- Promote Phosphene integration to
mainafter the user has tested on dev panel. - Quality-aware post-process β try a cheap learned upscaler instead of the seam blend (e.g. SeedVR2 via mflux's
mflux-upscale-seedvr2to take 1024 β 2048). - Text-cache reuse across denoising steps β fork mlx-vlm's Qwen3VLModel to cache the text-portion KV across the 28 denoising calls. ~2-5% speedup max but a real architectural improvement.
Hard stop conditions (still relevant)
- Q4 ships dark β established. Use Q8.
- mx.compile = 0% gain β established. Inference loop is at the floor.
- Splitting safetensors mid-read zeroed weights β fixed in converter; don't re-introduce.
How to ramp up fast (next session)
cd /Users/salo/HIDREAM-O1-MLX-LAB-activecat README.md CLAUDE.md STATE.md docs/EVALUATION.md(in that order)git log --oneline | head -10to see where we arels sample_outputs/to see what's been generated- To regenerate or extend: see the commands in CLAUDE.md
Disk situation snapshot
As of 2026-05-09 the data volume /dev/disk3s5 had ~45 GB free of 926 GB after the user's mid-session cleanup (deleted phosphene-model-lab.git and comfy.git, freed ~83 GB). The lab itself is ~16 GB on disk (10 GB Q8 + 6 GB Q4 models + 1.5 GB venv + samples + lab code). Do not re-download the HiDream HF source unless mlx_models/hidream-o1-dev-q4/ AND mlx_models/hidream-o1-dev-q8/ both go missing β both can be regenerated from one HF download.