Instructions to use mlx-community/HiDream-O1-Image-Dev-mlx-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/HiDream-O1-Image-Dev-mlx-bf16 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir HiDream-O1-Image-Dev-mlx-bf16 mlx-community/HiDream-O1-Image-Dev-mlx-bf16
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
File size: 7,298 Bytes
ffe929e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 | # HIDREAM-O1-MLX-LAB β STATE
**Last updated:** 2026-05-09 (session that landed Q8 + Phosphene integration)
## TL;DR β where we are
- **Q6 is the new sweet spot.** 1.30 s/step at 1024Γ1024, ~36 s per image, ~8.5 GB RAM. 2Γ faster than Q8 with equivalent quality.
- Q8 still works (2.36 s/step, 11.5 GB RAM) β keep it for deterministic upper-bound RAM use cases.
- Q4 deleted from disk: ships dark, no reason to keep around (regenerable in 5 min if needed).
- Backbone sizes: Q6 backbone 7.95 GB, Q8 backbone 9.96 GB. Custom heads 75 MB.
- **Shipped to Phosphene `dev`** as `kind="hidream"` in `agent/image_engine.py` (commits `45cad69`, `962b353`). Default model on Phosphene side will be updated to Q6.
- Showcase battery + A/B vs mflux Z-Image-Turbo done at Q8. At Q6, HiDream is now ~2Γ faster than Z-Image-Turbo (36s vs 80s) AND has lower deterministic RAM (8.5 GB vs 5.9β29.4 GB variable).
- 19+ sample images in `sample_outputs/`.
- Lab branch: `perf-lab-hidream-o1-mlx`, **no remote**.
## What's been done
| Date | Work | Commit |
|---|---|---|
| 2026-05-09 | Initial scaffolding (Path B chosen) | `746efe9` |
| 2026-05-09 | Wire mlx-vlm Qwen3VLModel directly (4D mask path) | `53eb605` |
| 2026-05-09 | First working images (mushroom 512, cat/beach/portrait 1024 at Q4) | `d944a31` |
| 2026-05-09 | Q8 conversion + samples (dark aesthetic was Q4, not the model) | `2bf029a` |
| 2026-05-09 | Showcase battery + evaluation + Phosphene plan | `0bac049` |
| 2026-05-09 | Phosphene integration shipped to `dev` | phos `45cad69` |
| 2026-05-09 | A/B vs mflux Z-Image-Turbo on 3 prompts | `2761ad8` |
| 2026-05-09 | Phosphene IMAGE_GEN_RESEARCH doc updated | phos `962b353` |
| 2026-05-09 | --blend-seams post-process (opt-in, below-threshold at Q8) | `0583356` |
| 2026-05-09 | Q6 = sweet spot (2Γ faster than Q8, same quality) β Phosphene default switched | `f4fb0ba` + phos `8a48953` |
| 2026-05-09 | Q6 verified across 10-prompt showcase battery | `4d3f18c` |
| 2026-05-09 | Edit/multi-ref scaffold (WIP β runs but output degenerate) | `525b7ec` |
| 2026-05-09 | BF16 default β Q4/Q6/Q8 all show patch-grid at non-square dims | (next) |
| 2026-05-09 | Phosphene default switched to BF16 | phos `af94bd0` |
| 2026-05-09 | OSS release prep: HF model card, LICENSE, requirements, gitignore | (next) |
## Known characteristics (not bugs)
- **Patch grid in flat regions** β architectural (PATCH_SIZE=32 with no overlap). Mild at Q8. `--blend-seams 1` is opt-in but doesn't visibly help.
- **Text rendering** β short, structured signs work ("BLOOM CAFE"). Long text falls apart.
- **Deterministic per-prompt RAM** β 11.5 GB at 1024 Q8 regardless of prompt complexity. Z-Image-Turbo varies wildly (5.9β29.4 GB).
## Open work / next session candidates
Pick from these, listed roughly cheapest-first:
1. ~~**2048Γ2048 Q8 generation pass**~~ β DONE 2026-05-09. `sample_outputs/v4_2048_alice_q8.png` β 276 s (9.86 s/step), peak RAM 10.8 GB. Q8 at 2048 is slower per step than Q4 (10s vs 8.4s) due to bandwidth. Output is showcase-grade: detailed cybernetic dress, holographic Cheshire cat, near-legible neon signs.
2. **Test the Phosphene integration through the dev panel UI** (port 8199). Generate one shot via the Image Studio dropdown, confirm pill goes green, the PNG lands.
3. **Edit / multi-reference path** β SCAFFOLD LANDED, NEEDS DEBUGGING.
- `build_edit_text_sample`, `resize_pilimage`, `calculate_dimensions`, `patchify_ref_image` all ported from upstream pipeline.py + utils.py.
- `--ref-images` flag wired in generate_hidream_o1_mlx.py.
- `precompute_text_embeds_with_vision` precomputes the text+vision embeds once before the loop (since they don't change with timestep) β a meaningful perf win.
- **Smoke test (synthesized two-color ref, K=1, 28 steps, Q6) runs end-to-end without errors but output is uniform tan/khaki.** T2I path with same prompt+seed produces a vibrant abstract correctly, so the model and weights are fine.
- Debugging done so far (see `scripts/hidream_o1/_edit_diag.py` and `_precompute_diag.py`):
- **All shapes verified correct** (input_ids 174 with 144 image-placeholders, vision tower outputs 144 features, vinput_mask = 256 tgt + 256 ref, position_ids 686 covering all spans).
- **Vision feature scatter verified mathematically correct** β at image_token positions `combined` equals `image_features` exactly (diff=0); at text positions `combined` equals `embed_tokens(input_ids)` exactly (diff=0). Vision features are well-behaved (mean ~0, std ~0.4).
- **Position_ids structure looks right** β text positions are sequential, target span gets fix_point=4096 base (per upstream), ref diffusion span continues sequentially.
- **Remaining suspects** (in order of likelihood):
- Mask construction: maybe text-row causal needs to ALSO see the K image_placeholder positions inside proc.input_ids? Upstream `_run_decoder_flash` has special handling β the non-flash 4D mask path may treat text positions as needing to see embedded vision features. Worth re-reading qwen3_vl_transformers.py:1486-1520.
- Position_ids semantic alignment: my appended-vinputs at positions [174..686) get mrope codes from input_ids_pad's vision_tokens portion, but maybe these need to match the appended embedding ORDER not just their positions in input_ids_pad.
- bf16 underflow in attention with the larger 686-token sequence vs T2I's 268.
- Samples: `sample_outputs/v6_edit_smoke.png` (degenerate, synthesized 2-color ref), `sample_outputs/v6_edit_cat_real.png` (degenerate, real cat photo as ref), `sample_outputs/v6_edit_t2i_baseline.png` (T2I works fine same prompt+seed).
- This is the single biggest open item. Would let HiDream replace mflux qwen-edit functionally.
4. **Promote Phosphene integration to `main`** after the user has tested on dev panel.
5. **Quality-aware post-process** β try a cheap learned upscaler instead of the seam blend (e.g. SeedVR2 via mflux's `mflux-upscale-seedvr2` to take 1024 β 2048).
6. **Text-cache reuse across denoising steps** β fork mlx-vlm's Qwen3VLModel to cache the text-portion KV across the 28 denoising calls. ~2-5% speedup max but a real architectural improvement.
## Hard stop conditions (still relevant)
- Q4 ships dark β established. Use Q8.
- mx.compile = 0% gain β established. Inference loop is at the floor.
- Splitting safetensors mid-read zeroed weights β fixed in converter; don't re-introduce.
## How to ramp up fast (next session)
1. `cd /Users/salo/HIDREAM-O1-MLX-LAB-active`
2. `cat README.md CLAUDE.md STATE.md docs/EVALUATION.md` (in that order)
3. `git log --oneline | head -10` to see where we are
4. `ls sample_outputs/` to see what's been generated
5. To regenerate or extend: see the commands in CLAUDE.md
## Disk situation snapshot
As of 2026-05-09 the data volume `/dev/disk3s5` had ~45 GB free of 926 GB after the user's mid-session cleanup (deleted `phosphene-model-lab.git` and `comfy.git`, freed ~83 GB). The lab itself is ~16 GB on disk (10 GB Q8 + 6 GB Q4 models + 1.5 GB venv + samples + lab code). **Do not re-download** the HiDream HF source unless `mlx_models/hidream-o1-dev-q4/` AND `mlx_models/hidream-o1-dev-q8/` both go missing β both can be regenerated from one HF download.
|