HIDREAM-O1-MLX-LAB — STATE

Last updated: 2026-05-09 (session that landed Q8 + Phosphene integration)

TL;DR — where we are

Q6 is the new sweet spot. 1.30 s/step at 1024×1024, ~36 s per image, ~8.5 GB RAM. 2× faster than Q8 with equivalent quality.
Q8 still works (2.36 s/step, 11.5 GB RAM) — keep it for deterministic upper-bound RAM use cases.
Q4 deleted from disk: ships dark, no reason to keep around (regenerable in 5 min if needed).
Backbone sizes: Q6 backbone 7.95 GB, Q8 backbone 9.96 GB. Custom heads 75 MB.
Shipped to Phosphene dev as kind="hidream" in agent/image_engine.py (commits 45cad69, 962b353). Default model on Phosphene side will be updated to Q6.
Showcase battery + A/B vs mflux Z-Image-Turbo done at Q8. At Q6, HiDream is now ~2× faster than Z-Image-Turbo (36s vs 80s) AND has lower deterministic RAM (8.5 GB vs 5.9–29.4 GB variable).
19+ sample images in sample_outputs/.
Lab branch: perf-lab-hidream-o1-mlx, no remote.

What's been done

Date	Work	Commit
2026-05-09	Initial scaffolding (Path B chosen)	`746efe9`
2026-05-09	Wire mlx-vlm Qwen3VLModel directly (4D mask path)	`53eb605`
2026-05-09	First working images (mushroom 512, cat/beach/portrait 1024 at Q4)	`d944a31`
2026-05-09	Q8 conversion + samples (dark aesthetic was Q4, not the model)	`2bf029a`
2026-05-09	Showcase battery + evaluation + Phosphene plan	`0bac049`
2026-05-09	Phosphene integration shipped to `dev`	phos `45cad69`
2026-05-09	A/B vs mflux Z-Image-Turbo on 3 prompts	`2761ad8`
2026-05-09	Phosphene IMAGE_GEN_RESEARCH doc updated	phos `962b353`
2026-05-09	--blend-seams post-process (opt-in, below-threshold at Q8)	`0583356`
2026-05-09	Q6 = sweet spot (2× faster than Q8, same quality) — Phosphene default switched	`f4fb0ba` + phos `8a48953`
2026-05-09	Q6 verified across 10-prompt showcase battery	`4d3f18c`
2026-05-09	Edit/multi-ref scaffold (WIP — runs but output degenerate)	`525b7ec`
2026-05-09	BF16 default — Q4/Q6/Q8 all show patch-grid at non-square dims	(next)
2026-05-09	Phosphene default switched to BF16	phos `af94bd0`
2026-05-09	OSS release prep: HF model card, LICENSE, requirements, gitignore	(next)

Known characteristics (not bugs)

Patch grid in flat regions — architectural (PATCH_SIZE=32 with no overlap). Mild at Q8. --blend-seams 1 is opt-in but doesn't visibly help.
Text rendering — short, structured signs work ("BLOOM CAFE"). Long text falls apart.
Deterministic per-prompt RAM — 11.5 GB at 1024 Q8 regardless of prompt complexity. Z-Image-Turbo varies wildly (5.9–29.4 GB).

Open work / next session candidates

Pick from these, listed roughly cheapest-first:

~~2048×2048 Q8 generation pass~~ — DONE 2026-05-09. sample_outputs/v4_2048_alice_q8.png — 276 s (9.86 s/step), peak RAM 10.8 GB. Q8 at 2048 is slower per step than Q4 (10s vs 8.4s) due to bandwidth. Output is showcase-grade: detailed cybernetic dress, holographic Cheshire cat, near-legible neon signs.
Test the Phosphene integration through the dev panel UI (port 8199). Generate one shot via the Image Studio dropdown, confirm pill goes green, the PNG lands.
Edit / multi-reference path — SCAFFOLD LANDED, NEEDS DEBUGGING.
- build_edit_text_sample, resize_pilimage, calculate_dimensions, patchify_ref_image all ported from upstream pipeline.py + utils.py.
- --ref-images flag wired in generate_hidream_o1_mlx.py.
- precompute_text_embeds_with_vision precomputes the text+vision embeds once before the loop (since they don't change with timestep) — a meaningful perf win.
- Smoke test (synthesized two-color ref, K=1, 28 steps, Q6) runs end-to-end without errors but output is uniform tan/khaki. T2I path with same prompt+seed produces a vibrant abstract correctly, so the model and weights are fine.
- Debugging done so far (see scripts/hidream_o1/_edit_diag.py and _precompute_diag.py):
  - All shapes verified correct (input_ids 174 with 144 image-placeholders, vision tower outputs 144 features, vinput_mask = 256 tgt + 256 ref, position_ids 686 covering all spans).
  - Vision feature scatter verified mathematically correct — at image_token positions combined equals image_features exactly (diff=0); at text positions combined equals embed_tokens(input_ids) exactly (diff=0). Vision features are well-behaved (mean ~0, std ~0.4).
  - Position_ids structure looks right — text positions are sequential, target span gets fix_point=4096 base (per upstream), ref diffusion span continues sequentially.
- Remaining suspects (in order of likelihood):
  - Mask construction: maybe text-row causal needs to ALSO see the K image_placeholder positions inside proc.input_ids? Upstream _run_decoder_flash has special handling — the non-flash 4D mask path may treat text positions as needing to see embedded vision features. Worth re-reading qwen3_vl_transformers.py:1486-1520.
  - Position_ids semantic alignment: my appended-vinputs at positions [174..686) get mrope codes from input_ids_pad's vision_tokens portion, but maybe these need to match the appended embedding ORDER not just their positions in input_ids_pad.
  - bf16 underflow in attention with the larger 686-token sequence vs T2I's 268.
- Samples: sample_outputs/v6_edit_smoke.png (degenerate, synthesized 2-color ref), sample_outputs/v6_edit_cat_real.png (degenerate, real cat photo as ref), sample_outputs/v6_edit_t2i_baseline.png (T2I works fine same prompt+seed).
- This is the single biggest open item. Would let HiDream replace mflux qwen-edit functionally.
Promote Phosphene integration to main after the user has tested on dev panel.
Quality-aware post-process — try a cheap learned upscaler instead of the seam blend (e.g. SeedVR2 via mflux's mflux-upscale-seedvr2 to take 1024 → 2048).
Text-cache reuse across denoising steps — fork mlx-vlm's Qwen3VLModel to cache the text-portion KV across the 28 denoising calls. ~2-5% speedup max but a real architectural improvement.

Hard stop conditions (still relevant)

Q4 ships dark — established. Use Q8.
mx.compile = 0% gain — established. Inference loop is at the floor.
Splitting safetensors mid-read zeroed weights — fixed in converter; don't re-introduce.

How to ramp up fast (next session)

cd /Users/salo/HIDREAM-O1-MLX-LAB-active
cat README.md CLAUDE.md STATE.md docs/EVALUATION.md (in that order)
git log --oneline | head -10 to see where we are
ls sample_outputs/ to see what's been generated
To regenerate or extend: see the commands in CLAUDE.md

Disk situation snapshot

As of 2026-05-09 the data volume /dev/disk3s5 had ~45 GB free of 926 GB after the user's mid-session cleanup (deleted phosphene-model-lab.git and comfy.git, freed ~83 GB). The lab itself is ~16 GB on disk (10 GB Q8 + 6 GB Q4 models + 1.5 GB venv + samples + lab code). Do not re-download the HiDream HF source unless mlx_models/hidream-o1-dev-q4/ AND mlx_models/hidream-o1-dev-q8/ both go missing — both can be regenerated from one HF download.