File size: 7,298 Bytes

ffe929e

# HIDREAM-O1-MLX-LAB — STATE

**Last updated:** 2026-05-09 (session that landed Q8 + Phosphene integration)

## TL;DR — where we are

- **Q6 is the new sweet spot.** 1.30 s/step at 1024×1024, ~36 s per image, ~8.5 GB RAM. 2× faster than Q8 with equivalent quality.
- Q8 still works (2.36 s/step, 11.5 GB RAM) — keep it for deterministic upper-bound RAM use cases.
- Q4 deleted from disk: ships dark, no reason to keep around (regenerable in 5 min if needed).
- Backbone sizes: Q6 backbone 7.95 GB, Q8 backbone 9.96 GB. Custom heads 75 MB.
- **Shipped to Phosphene `dev`** as `kind="hidream"` in `agent/image_engine.py` (commits `45cad69`, `962b353`). Default model on Phosphene side will be updated to Q6.
- Showcase battery + A/B vs mflux Z-Image-Turbo done at Q8. At Q6, HiDream is now ~2× faster than Z-Image-Turbo (36s vs 80s) AND has lower deterministic RAM (8.5 GB vs 5.9–29.4 GB variable).
- 19+ sample images in `sample_outputs/`.
- Lab branch: `perf-lab-hidream-o1-mlx`, **no remote**.

## What's been done

| Date | Work | Commit |
|---|---|---|
| 2026-05-09 | Initial scaffolding (Path B chosen) | `746efe9` |
| 2026-05-09 | Wire mlx-vlm Qwen3VLModel directly (4D mask path) | `53eb605` |
| 2026-05-09 | First working images (mushroom 512, cat/beach/portrait 1024 at Q4) | `d944a31` |
| 2026-05-09 | Q8 conversion + samples (dark aesthetic was Q4, not the model) | `2bf029a` |
| 2026-05-09 | Showcase battery + evaluation + Phosphene plan | `0bac049` |
| 2026-05-09 | Phosphene integration shipped to `dev` | phos `45cad69` |
| 2026-05-09 | A/B vs mflux Z-Image-Turbo on 3 prompts | `2761ad8` |
| 2026-05-09 | Phosphene IMAGE_GEN_RESEARCH doc updated | phos `962b353` |
| 2026-05-09 | --blend-seams post-process (opt-in, below-threshold at Q8) | `0583356` |
| 2026-05-09 | Q6 = sweet spot (2× faster than Q8, same quality) — Phosphene default switched | `f4fb0ba` + phos `8a48953` |
| 2026-05-09 | Q6 verified across 10-prompt showcase battery | `4d3f18c` |
| 2026-05-09 | Edit/multi-ref scaffold (WIP — runs but output degenerate) | `525b7ec` |
| 2026-05-09 | BF16 default — Q4/Q6/Q8 all show patch-grid at non-square dims | (next) |
| 2026-05-09 | Phosphene default switched to BF16 | phos `af94bd0` |
| 2026-05-09 | OSS release prep: HF model card, LICENSE, requirements, gitignore | (next) |

## Known characteristics (not bugs)

- **Patch grid in flat regions** — architectural (PATCH_SIZE=32 with no overlap). Mild at Q8. `--blend-seams 1` is opt-in but doesn't visibly help.
- **Text rendering** — short, structured signs work ("BLOOM CAFE"). Long text falls apart.
- **Deterministic per-prompt RAM** — 11.5 GB at 1024 Q8 regardless of prompt complexity. Z-Image-Turbo varies wildly (5.9–29.4 GB).

## Open work / next session candidates

Pick from these, listed roughly cheapest-first:

1. ~~**2048×2048 Q8 generation pass**~~ — DONE 2026-05-09. `sample_outputs/v4_2048_alice_q8.png` — 276 s (9.86 s/step), peak RAM 10.8 GB. Q8 at 2048 is slower per step than Q4 (10s vs 8.4s) due to bandwidth. Output is showcase-grade: detailed cybernetic dress, holographic Cheshire cat, near-legible neon signs.
2. **Test the Phosphene integration through the dev panel UI** (port 8199). Generate one shot via the Image Studio dropdown, confirm pill goes green, the PNG lands.
3. **Edit / multi-reference path** — SCAFFOLD LANDED, NEEDS DEBUGGING.
   - `build_edit_text_sample`, `resize_pilimage`, `calculate_dimensions`, `patchify_ref_image` all ported from upstream pipeline.py + utils.py.
   - `--ref-images` flag wired in generate_hidream_o1_mlx.py.
   - `precompute_text_embeds_with_vision` precomputes the text+vision embeds once before the loop (since they don't change with timestep) — a meaningful perf win.
   - **Smoke test (synthesized two-color ref, K=1, 28 steps, Q6) runs end-to-end without errors but output is uniform tan/khaki.** T2I path with same prompt+seed produces a vibrant abstract correctly, so the model and weights are fine.
   - Debugging done so far (see `scripts/hidream_o1/_edit_diag.py` and `_precompute_diag.py`):
     - **All shapes verified correct** (input_ids 174 with 144 image-placeholders, vision tower outputs 144 features, vinput_mask = 256 tgt + 256 ref, position_ids 686 covering all spans).
     - **Vision feature scatter verified mathematically correct** — at image_token positions `combined` equals `image_features` exactly (diff=0); at text positions `combined` equals `embed_tokens(input_ids)` exactly (diff=0). Vision features are well-behaved (mean ~0, std ~0.4).
     - **Position_ids structure looks right** — text positions are sequential, target span gets fix_point=4096 base (per upstream), ref diffusion span continues sequentially.
   - **Remaining suspects** (in order of likelihood):
     - Mask construction: maybe text-row causal needs to ALSO see the K image_placeholder positions inside proc.input_ids? Upstream `_run_decoder_flash` has special handling — the non-flash 4D mask path may treat text positions as needing to see embedded vision features. Worth re-reading qwen3_vl_transformers.py:1486-1520.
     - Position_ids semantic alignment: my appended-vinputs at positions [174..686) get mrope codes from input_ids_pad's vision_tokens portion, but maybe these need to match the appended embedding ORDER not just their positions in input_ids_pad.
     - bf16 underflow in attention with the larger 686-token sequence vs T2I's 268.
   - Samples: `sample_outputs/v6_edit_smoke.png` (degenerate, synthesized 2-color ref), `sample_outputs/v6_edit_cat_real.png` (degenerate, real cat photo as ref), `sample_outputs/v6_edit_t2i_baseline.png` (T2I works fine same prompt+seed).
   - This is the single biggest open item. Would let HiDream replace mflux qwen-edit functionally.
4. **Promote Phosphene integration to `main`** after the user has tested on dev panel.
5. **Quality-aware post-process** — try a cheap learned upscaler instead of the seam blend (e.g. SeedVR2 via mflux's `mflux-upscale-seedvr2` to take 1024 → 2048).
6. **Text-cache reuse across denoising steps** — fork mlx-vlm's Qwen3VLModel to cache the text-portion KV across the 28 denoising calls. ~2-5% speedup max but a real architectural improvement.

## Hard stop conditions (still relevant)

- Q4 ships dark — established. Use Q8.
- mx.compile = 0% gain — established. Inference loop is at the floor.
- Splitting safetensors mid-read zeroed weights — fixed in converter; don't re-introduce.

## How to ramp up fast (next session)

1. `cd /Users/salo/HIDREAM-O1-MLX-LAB-active`
2. `cat README.md CLAUDE.md STATE.md docs/EVALUATION.md` (in that order)
3. `git log --oneline | head -10` to see where we are
4. `ls sample_outputs/` to see what's been generated
5. To regenerate or extend: see the commands in CLAUDE.md

## Disk situation snapshot

As of 2026-05-09 the data volume `/dev/disk3s5` had ~45 GB free of 926 GB after the user's mid-session cleanup (deleted `phosphene-model-lab.git` and `comfy.git`, freed ~83 GB). The lab itself is ~16 GB on disk (10 GB Q8 + 6 GB Q4 models + 1.5 GB venv + samples + lab code). **Do not re-download** the HiDream HF source unless `mlx_models/hidream-o1-dev-q4/` AND `mlx_models/hidream-o1-dev-q8/` both go missing — both can be regenerated from one HF download.