Mrbizarro's picture
Initial release: code, docs, hero samples
ffe929e verified

HIDREAM-O1-MLX-LAB β€” STATE

Last updated: 2026-05-09 (session that landed Q8 + Phosphene integration)

TL;DR β€” where we are

  • Q6 is the new sweet spot. 1.30 s/step at 1024Γ—1024, ~36 s per image, ~8.5 GB RAM. 2Γ— faster than Q8 with equivalent quality.
  • Q8 still works (2.36 s/step, 11.5 GB RAM) β€” keep it for deterministic upper-bound RAM use cases.
  • Q4 deleted from disk: ships dark, no reason to keep around (regenerable in 5 min if needed).
  • Backbone sizes: Q6 backbone 7.95 GB, Q8 backbone 9.96 GB. Custom heads 75 MB.
  • Shipped to Phosphene dev as kind="hidream" in agent/image_engine.py (commits 45cad69, 962b353). Default model on Phosphene side will be updated to Q6.
  • Showcase battery + A/B vs mflux Z-Image-Turbo done at Q8. At Q6, HiDream is now ~2Γ— faster than Z-Image-Turbo (36s vs 80s) AND has lower deterministic RAM (8.5 GB vs 5.9–29.4 GB variable).
  • 19+ sample images in sample_outputs/.
  • Lab branch: perf-lab-hidream-o1-mlx, no remote.

What's been done

Date Work Commit
2026-05-09 Initial scaffolding (Path B chosen) 746efe9
2026-05-09 Wire mlx-vlm Qwen3VLModel directly (4D mask path) 53eb605
2026-05-09 First working images (mushroom 512, cat/beach/portrait 1024 at Q4) d944a31
2026-05-09 Q8 conversion + samples (dark aesthetic was Q4, not the model) 2bf029a
2026-05-09 Showcase battery + evaluation + Phosphene plan 0bac049
2026-05-09 Phosphene integration shipped to dev phos 45cad69
2026-05-09 A/B vs mflux Z-Image-Turbo on 3 prompts 2761ad8
2026-05-09 Phosphene IMAGE_GEN_RESEARCH doc updated phos 962b353
2026-05-09 --blend-seams post-process (opt-in, below-threshold at Q8) 0583356
2026-05-09 Q6 = sweet spot (2Γ— faster than Q8, same quality) β€” Phosphene default switched f4fb0ba + phos 8a48953
2026-05-09 Q6 verified across 10-prompt showcase battery 4d3f18c
2026-05-09 Edit/multi-ref scaffold (WIP β€” runs but output degenerate) 525b7ec
2026-05-09 BF16 default β€” Q4/Q6/Q8 all show patch-grid at non-square dims (next)
2026-05-09 Phosphene default switched to BF16 phos af94bd0
2026-05-09 OSS release prep: HF model card, LICENSE, requirements, gitignore (next)

Known characteristics (not bugs)

  • Patch grid in flat regions β€” architectural (PATCH_SIZE=32 with no overlap). Mild at Q8. --blend-seams 1 is opt-in but doesn't visibly help.
  • Text rendering β€” short, structured signs work ("BLOOM CAFE"). Long text falls apart.
  • Deterministic per-prompt RAM β€” 11.5 GB at 1024 Q8 regardless of prompt complexity. Z-Image-Turbo varies wildly (5.9–29.4 GB).

Open work / next session candidates

Pick from these, listed roughly cheapest-first:

  1. 2048Γ—2048 Q8 generation pass β€” DONE 2026-05-09. sample_outputs/v4_2048_alice_q8.png β€” 276 s (9.86 s/step), peak RAM 10.8 GB. Q8 at 2048 is slower per step than Q4 (10s vs 8.4s) due to bandwidth. Output is showcase-grade: detailed cybernetic dress, holographic Cheshire cat, near-legible neon signs.
  2. Test the Phosphene integration through the dev panel UI (port 8199). Generate one shot via the Image Studio dropdown, confirm pill goes green, the PNG lands.
  3. Edit / multi-reference path β€” SCAFFOLD LANDED, NEEDS DEBUGGING.
    • build_edit_text_sample, resize_pilimage, calculate_dimensions, patchify_ref_image all ported from upstream pipeline.py + utils.py.
    • --ref-images flag wired in generate_hidream_o1_mlx.py.
    • precompute_text_embeds_with_vision precomputes the text+vision embeds once before the loop (since they don't change with timestep) β€” a meaningful perf win.
    • Smoke test (synthesized two-color ref, K=1, 28 steps, Q6) runs end-to-end without errors but output is uniform tan/khaki. T2I path with same prompt+seed produces a vibrant abstract correctly, so the model and weights are fine.
    • Debugging done so far (see scripts/hidream_o1/_edit_diag.py and _precompute_diag.py):
      • All shapes verified correct (input_ids 174 with 144 image-placeholders, vision tower outputs 144 features, vinput_mask = 256 tgt + 256 ref, position_ids 686 covering all spans).
      • Vision feature scatter verified mathematically correct β€” at image_token positions combined equals image_features exactly (diff=0); at text positions combined equals embed_tokens(input_ids) exactly (diff=0). Vision features are well-behaved (mean ~0, std ~0.4).
      • Position_ids structure looks right β€” text positions are sequential, target span gets fix_point=4096 base (per upstream), ref diffusion span continues sequentially.
    • Remaining suspects (in order of likelihood):
      • Mask construction: maybe text-row causal needs to ALSO see the K image_placeholder positions inside proc.input_ids? Upstream _run_decoder_flash has special handling β€” the non-flash 4D mask path may treat text positions as needing to see embedded vision features. Worth re-reading qwen3_vl_transformers.py:1486-1520.
      • Position_ids semantic alignment: my appended-vinputs at positions [174..686) get mrope codes from input_ids_pad's vision_tokens portion, but maybe these need to match the appended embedding ORDER not just their positions in input_ids_pad.
      • bf16 underflow in attention with the larger 686-token sequence vs T2I's 268.
    • Samples: sample_outputs/v6_edit_smoke.png (degenerate, synthesized 2-color ref), sample_outputs/v6_edit_cat_real.png (degenerate, real cat photo as ref), sample_outputs/v6_edit_t2i_baseline.png (T2I works fine same prompt+seed).
    • This is the single biggest open item. Would let HiDream replace mflux qwen-edit functionally.
  4. Promote Phosphene integration to main after the user has tested on dev panel.
  5. Quality-aware post-process β€” try a cheap learned upscaler instead of the seam blend (e.g. SeedVR2 via mflux's mflux-upscale-seedvr2 to take 1024 β†’ 2048).
  6. Text-cache reuse across denoising steps β€” fork mlx-vlm's Qwen3VLModel to cache the text-portion KV across the 28 denoising calls. ~2-5% speedup max but a real architectural improvement.

Hard stop conditions (still relevant)

  • Q4 ships dark β€” established. Use Q8.
  • mx.compile = 0% gain β€” established. Inference loop is at the floor.
  • Splitting safetensors mid-read zeroed weights β€” fixed in converter; don't re-introduce.

How to ramp up fast (next session)

  1. cd /Users/salo/HIDREAM-O1-MLX-LAB-active
  2. cat README.md CLAUDE.md STATE.md docs/EVALUATION.md (in that order)
  3. git log --oneline | head -10 to see where we are
  4. ls sample_outputs/ to see what's been generated
  5. To regenerate or extend: see the commands in CLAUDE.md

Disk situation snapshot

As of 2026-05-09 the data volume /dev/disk3s5 had ~45 GB free of 926 GB after the user's mid-session cleanup (deleted phosphene-model-lab.git and comfy.git, freed ~83 GB). The lab itself is ~16 GB on disk (10 GB Q8 + 6 GB Q4 models + 1.5 GB venv + samples + lab code). Do not re-download the HiDream HF source unless mlx_models/hidream-o1-dev-q4/ AND mlx_models/hidream-o1-dev-q8/ both go missing β€” both can be regenerated from one HF download.