Initial release: code, docs, hero samples

ffe929e verified 13 days ago

10.6 kB

	# HiDream-O1-Image-Dev (Q8 MLX) — evaluation

	Setup: lab branch `perf-lab-hidream-o1-mlx`, mlx-vlm 0.5.0 + mlx 0.31.2, Mac Studio (64 GB).
	Recipe: Dev — 28 steps, FlashFlowMatch, `s_noise=7.5`, `noise_clip_std=2.5`, `shift=1.0`.
	All times are honest wall-clock with `mx.eval` per step. All RAM is peak `maximum resident set size`.

	## Q6 showcase verification (2026-05-09 evening)

	Re-ran the same 10-prompt battery at Q6 with identical seeds. All 10 are visually equivalent or better than the Q8 versions:

	- 9/10 are near-pixel-identical aesthetics (different latent noise from quant differences yields same compositions / lighting / subjects)
	- 10 (text rendering) is visibly better at Q6 — "BLOOM CAFE" neon sign is crisp at Q6 vs a glitched "M" at Q8

	Per-image timing was rock-steady at 35.9 s (1.28 s/step). Total battery time: ~6 minutes vs ~12 minutes at Q8.

	Outputs: `sample_outputs/showcase_q6/` (compare against `sample_outputs/showcase/` for the Q8 originals).

	## Battery: 10 prompts, 1024×1024, all Q8

	\| # \| Genre \| Prompt summary \| Result \| Time \|
	\|---\|---\|---\|---\|---\|
	\| 01 \| photo portrait \| elderly Japanese tea master \| Excellent — face character, gentle smile, paper screens, calligraphy \| 81.5 s* \|
	\| 02 \| anime / illustration \| pink-haired girl on Tokyo rooftop at dusk \| Excellent — anime style + cherry blossoms + neon city below \| 65.3 s \|
	\| 03 \| macro photo \| dewdrop on spiderweb \| Excellent — refractions, blurred leaf bg, crisp web detail \| 65.9 s \|
	\| 04 \| architecture \| futuristic library, holographic displays \| Excellent — vaulted ceiling, stained glass, holo screens \| 66.3 s \|
	\| 05 \| surreal painting \| whale floating over desert at sunset \| Excellent — magical realism, painterly clouds \| 65.8 s \|
	\| 06 \| food flatlay \| rustic Italian breakfast on marble \| Excellent — golden croissants, espresso, berries, soft light \| 66.4 s \|
	\| 07 \| cinematic action \| samurai mid-leap with katana, Mt. Fuji bg \| Excellent — dynamic pose, cherry blossoms, real mountain \| 66.1 s \|
	\| 08 \| fantasy \| dragon on crystal mountain with aurora \| Excellent — iridescent scales, snow swirling, aurora visible \| 66.4 s \|
	\| 09 \| wildlife photo \| snow leopard staring at camera \| Excellent — direct gaze, falling snow, mountain bg \| 67.1 s \|
	\| 10 \| text rendering \| "BLOOM CAFE" pink neon diner \| Good — sign legible (small "M" glitch), retro diner, rainy street \| 67.1 s \|

	*Image 01 included cold model load (~12-15 s).

	Steady-state per-image: 65-67 s at 1024×1024 Q8. Dead-consistent across genres.

	## Honest timings

	\| Resolution \| Quant \| Per step \| Total (28 steps) \| Peak RAM \|
	\|---\|---\|---\|---\|---\|
	\| 512×512 \| Q4 \| 0.89 s \| 24.9 s \| ~6 GB \|
	\| 1024×1024 \| Q4 \| 2.37 s \| 66 s \| ~6 GB \|
	\| 1024×1024 \| Q6 \| 1.30 s \| 36 s \| ~8.5 GB \|
	\| 1024×1024 \| Q8 \| 2.36 s \| 66 s \| ~11.5 GB \|
	\| 1280×704 \| Q8 \| 2.53 s \| 70.7 s \| ~7 GB \|
	\| 704×1280 \| Q8 \| 2.35 s \| 65.9 s \| ~3 GB (warm cache) \|
	\| 2048×2048 \| Q4 \| 8.44 s \| 236 s \| ~7.2 GB \|
	\| 2048×2048 \| Q8 \| 9.86 s \| 276 s \| ~10.8 GB \|

	Q6 is the sweet spot. 2× faster than Q8 at 1024 with the same prompt fidelity (cat in sunlit kitchen + beach with palm trees both rendered identically to Q8 outputs). 30% less RAM. The bandwidth-bound theory holds: fewer bits per param → less weight bandwidth → faster per-step.

	Q4 corrupts brightness (ships dark) so the speed of Q4 vs Q6 is academic — never use Q4 for production. Q6 has the speed and Q8 has the steady-state safety; Q6 wins on perf, Q8 wins on a deterministic upper bound on RAM.

	## Where HiDream-O1-Image-Dev shines

	- Subject identity — every prompt subject was rendered correctly. No "vibrant orange tabby" → cat-shape-blob. The model knows what things look like.
	- Multi-element scenes — samurai + Fuji + cherry blossoms; cyberpunk Alice + neon Cheshire cat + circuit dress + rain. Composition stays coherent.
	- Style adherence — anime ≠ photorealism ≠ oil painting ≠ macro. Got all four right.
	- Light realism — the architecture image's light through stained glass; the food flatlay's morning warmth; the action scene's sunset rim lighting. Light feels real, not stamped on.
	- Text rendering (limited) — "BLOOM CAFE" in neon was readable. Better than most diffusion models; not as clean as a model with explicit OCR pretraining.

	## Where it's weak

	- Patch-grid artifact in flat regions. PATCH_SIZE=32 with no overlap → visible 32×32 grid in skies, water, walls. Most visible at low-frequency content. Architectural — not fixable without retraining or an overlap-blending postprocess.
	- Q4 brightness collapse — Q4 desaturates and darkens everything. Q8 fixes it. Ship Q8.
	- Hands — hands when present in scenes (e.g. tea master holding cup) look fine at moderate detail, but the model isn't immune to the standard diffusion hand failure modes; haven't stress-tested.
	- Dense long text — "BLOOM CAFE" is short and structured. A paragraph of text would likely fall apart.
	- Speed at 2048 — 4 minutes per image is slow for iterative work. Fine for a final pass.

	## Sweet spot

	1024×1024, Q6, default Dev recipe, ~36 s/image, ~8.5 GB RAM. Bright/colourful output equivalent to Q8, half the wall time, 30% less RAM. 512 is fast (~25 s) but loses detail. 2048 is gorgeous but iterative-unfriendly.

	Quant decision tree:
	- 16 GB Mac → don't run HiDream; use mflux Z-Image-Turbo
	- 32 GB Mac → Q6 is comfortable, Q8 leaves no headroom alongside LTX
	- 64 GB Mac → Q6 default; Q8 only when you want deterministic upper-bound RAM

	## A/B vs mflux Z-Image-Turbo

	Same prompts, same seeds, both at 1024×1024.

	\| # \| Prompt \| HiDream Q8 \| Z-Image-Turbo Q4 (mflux) \| Subjective winner \|
	\|---\|---\|---\|---\|---\|
	\| 1 \| tea master \| [v3](../sample_outputs/showcase/01_portrait_photo.png) — wide scene, paper screens, calligraphy \| [zimg](../sample_outputs/ab_mflux/01_portrait_zimage.png) — tighter portrait, gray garment, smile \| Tie — different framings, both excellent \|
	\| 2 \| sunlit beach \| [v3](../sample_outputs/v3_1024_beach_q8.png) — turquoise water, palm trees, beach chair \| [zimg](../sample_outputs/ab_mflux/02_beach_zimage.png) — vivid blue water, palms, big sand foreground \| Tie — both nail the prompt \|
	\| 3 \| alice cyberpunk \| [v3](../sample_outputs/v3_alice_horizontal_q8.png) (horizontal) — clear dress + face + Cheshire \| [zimg](../sample_outputs/ab_mflux/03_alice_zimage.png) — more painterly, atmospheric Cheshire silhouette \| HiDream for face/dress detail; Z-Image for atmosphere \|

	### Speed + RAM (measured, not estimated)

	\| Engine \| Steps \| Wall (1024) \| Per step \| Peak RAM \|
	\|---\|---\|---\|---\|---\|
	\| HiDream-O1-Dev / Q8 \| 28 \| 67 s \| 2.41 s \| 11.5 GB \|
	\| Z-Image-Turbo / Q4 \| 9 \| 80 s \| 8.85 s \| 5.9–29.4 GB (varies by prompt) \|

	Surprises:
	- HiDream is faster per image despite needing 28 steps vs Z-Image-Turbo's 9 — Z-Image's per-step cost is ~3.7× HiDream's.
	- Z-Image's peak RAM varied wildly across prompts (5.9 GB for portrait, 29.4 GB for the alice cyberpunk). HiDream's peak was steady at ~11.5 GB regardless of prompt complexity.

	### Verdict

	Both are excellent local engines. Pick by the workload:

	- Default/compact: keep Z-Image-Turbo — 5.9 GB RAM on most prompts, runs anywhere.
	- Hero shots / max prompt fidelity: HiDream-O1-Q8 — faster wall time, deterministic memory, more environmental detail in the output.
	- Editing / multi-ref: keep mflux qwen-edit — HiDream lab pipeline doesn't support refs yet.

	## Patch-grid post-blend experiment

	Implemented `--blend-seams <radius>` post-process in `generate_hidream_o1_mlx.py`: after decoding the final image, average a thin band across each 32-pixel patch boundary line (radius=1 → blend the seam row with one neighbour on each side, then 50% blend back into the seam itself).

	Result on the same beach prompt + seed 11 + Q8:

	\| Comparison \| Mean abs diff (out of 255) \|
	\|---\|---\|
	\| baseline vs blend r=1 \| 0.18 \|
	\| baseline vs blend r=2 \| 0.23 \|

	Per-row breakdown confirms the blend is surgical — only seam rows (every 32) change, by 1–2.7 pixel values; non-seam rows shift by <0.2. So the math is doing exactly what it says.

	But visually: at Q8 the seam artifact is already mild. The blend's 1–2 pixel-value smoothing is below visual threshold. No win, but no harm — and zero added latency (numpy vector ops on a 1024×1024 image are sub-ms).

	Bottom line: kept as opt-in flag `--blend-seams 1`. Did not enable by default. The real fix for the patch grid would need overlap-blended patches (architectural change) or a stronger spatial filter (which would visibly blur the image).

	## Software-side speed: nothing left

	Tested `mx.compile` on the forward pass: 0% improvement (2.366 s/step compiled vs 2.368 s/step uncompiled). The forward is already bandwidth-bound by the 36-layer Q8 decoder's matmul stream — MLX is already at near-GPU-saturation. Same conclusion for `mx.fast.scaled_dot_product_attention` (already used inside mlx-vlm's Qwen3VLAttention).

	The path to faster is architectural, not algorithmic:
	- Fewer steps (would need a smaller distillation; Dev is already the distilled variant)
	- Smaller backbone (would need re-distillation onto a 4B Qwen3-VL — no public version)
	- Caching the text-portion hidden states across denoising steps — possible but invasive (would need to subclass mlx-vlm's Qwen3VLModel; ~2-5% speedup at best since text is <2% of seq length)

	## Verdict

	- Working. Q8 produces real, prompt-faithful, high-quality images at ~67 s/1024.
	- No more easy speedups. The lab's inference loop is already at the floor for this architecture on this hardware.
	- Patch artifacts are real but mild. Low-frequency regions show a 32-pixel grid. Subjects-with-content scenes hide it well.
	- Q8 is the only acceptable quant. Q4 ships dark. If we ever want a smaller variant, would need different bit packing or selective Q6.

	## Recommendation for Phosphene

	Slot it in as a third local engine alongside `mflux Z-Image-Turbo` (compact tier) and `mflux FLUX.2-klein-4B` (comfortable tier). Mark HiDream as comfortable+ (32 GB+) due to the 11.5 GB working set. Don't make it the default — it's slower per image and uses more RAM than Z-Image-Turbo. Make it the option for users who want max prompt fidelity and license clarity (MIT, no NC restriction).

	See [PHOSPHENE_INTEGRATION_PLAN.md](PHOSPHENE_INTEGRATION_PLAN.md) for the patch.

	# HiDream-O1-Image-Dev (Q8 MLX) — evaluation

	Setup: lab branch `perf-lab-hidream-o1-mlx`, mlx-vlm 0.5.0 + mlx 0.31.2, Mac Studio (64 GB).
	Recipe: Dev — 28 steps, FlashFlowMatch, `s_noise=7.5`, `noise_clip_std=2.5`, `shift=1.0`.
	All times are honest wall-clock with `mx.eval` per step. All RAM is peak `maximum resident set size`.

	## Q6 showcase verification (2026-05-09 evening)

	Re-ran the same 10-prompt battery at Q6 with identical seeds. All 10 are visually equivalent or better than the Q8 versions:

	- 9/10 are near-pixel-identical aesthetics (different latent noise from quant differences yields same compositions / lighting / subjects)
	- 10 (text rendering) is visibly better at Q6 — "BLOOM CAFE" neon sign is crisp at Q6 vs a glitched "M" at Q8

	Per-image timing was rock-steady at 35.9 s (1.28 s/step). Total battery time: ~6 minutes vs ~12 minutes at Q8.

	Outputs: `sample_outputs/showcase_q6/` (compare against `sample_outputs/showcase/` for the Q8 originals).

	## Battery: 10 prompts, 1024×1024, all Q8

	\| # \| Genre \| Prompt summary \| Result \| Time \|
	\|---\|---\|---\|---\|---\|
	\| 01 \| photo portrait \| elderly Japanese tea master \| Excellent — face character, gentle smile, paper screens, calligraphy \| 81.5 s* \|
	\| 02 \| anime / illustration \| pink-haired girl on Tokyo rooftop at dusk \| Excellent — anime style + cherry blossoms + neon city below \| 65.3 s \|
	\| 03 \| macro photo \| dewdrop on spiderweb \| Excellent — refractions, blurred leaf bg, crisp web detail \| 65.9 s \|
	\| 04 \| architecture \| futuristic library, holographic displays \| Excellent — vaulted ceiling, stained glass, holo screens \| 66.3 s \|
	\| 05 \| surreal painting \| whale floating over desert at sunset \| Excellent — magical realism, painterly clouds \| 65.8 s \|
	\| 06 \| food flatlay \| rustic Italian breakfast on marble \| Excellent — golden croissants, espresso, berries, soft light \| 66.4 s \|
	\| 07 \| cinematic action \| samurai mid-leap with katana, Mt. Fuji bg \| Excellent — dynamic pose, cherry blossoms, real mountain \| 66.1 s \|
	\| 08 \| fantasy \| dragon on crystal mountain with aurora \| Excellent — iridescent scales, snow swirling, aurora visible \| 66.4 s \|
	\| 09 \| wildlife photo \| snow leopard staring at camera \| Excellent — direct gaze, falling snow, mountain bg \| 67.1 s \|
	\| 10 \| text rendering \| "BLOOM CAFE" pink neon diner \| Good — sign legible (small "M" glitch), retro diner, rainy street \| 67.1 s \|

	*Image 01 included cold model load (~12-15 s).

	Steady-state per-image: 65-67 s at 1024×1024 Q8. Dead-consistent across genres.

	## Honest timings

	\| Resolution \| Quant \| Per step \| Total (28 steps) \| Peak RAM \|
	\|---\|---\|---\|---\|---\|
	\| 512×512 \| Q4 \| 0.89 s \| 24.9 s \| ~6 GB \|
	\| 1024×1024 \| Q4 \| 2.37 s \| 66 s \| ~6 GB \|
	\| 1024×1024 \| Q6 \| 1.30 s \| 36 s \| ~8.5 GB \|
	\| 1024×1024 \| Q8 \| 2.36 s \| 66 s \| ~11.5 GB \|
	\| 1280×704 \| Q8 \| 2.53 s \| 70.7 s \| ~7 GB \|
	\| 704×1280 \| Q8 \| 2.35 s \| 65.9 s \| ~3 GB (warm cache) \|
	\| 2048×2048 \| Q4 \| 8.44 s \| 236 s \| ~7.2 GB \|
	\| 2048×2048 \| Q8 \| 9.86 s \| 276 s \| ~10.8 GB \|

	Q6 is the sweet spot. 2× faster than Q8 at 1024 with the same prompt fidelity (cat in sunlit kitchen + beach with palm trees both rendered identically to Q8 outputs). 30% less RAM. The bandwidth-bound theory holds: fewer bits per param → less weight bandwidth → faster per-step.

	Q4 corrupts brightness (ships dark) so the speed of Q4 vs Q6 is academic — never use Q4 for production. Q6 has the speed and Q8 has the steady-state safety; Q6 wins on perf, Q8 wins on a deterministic upper bound on RAM.

	## Where HiDream-O1-Image-Dev shines

	- Subject identity — every prompt subject was rendered correctly. No "vibrant orange tabby" → cat-shape-blob. The model knows what things look like.
	- Multi-element scenes — samurai + Fuji + cherry blossoms; cyberpunk Alice + neon Cheshire cat + circuit dress + rain. Composition stays coherent.
	- Style adherence — anime ≠ photorealism ≠ oil painting ≠ macro. Got all four right.
	- Light realism — the architecture image's light through stained glass; the food flatlay's morning warmth; the action scene's sunset rim lighting. Light feels real, not stamped on.
	- Text rendering (limited) — "BLOOM CAFE" in neon was readable. Better than most diffusion models; not as clean as a model with explicit OCR pretraining.

	## Where it's weak

	- Patch-grid artifact in flat regions. PATCH_SIZE=32 with no overlap → visible 32×32 grid in skies, water, walls. Most visible at low-frequency content. Architectural — not fixable without retraining or an overlap-blending postprocess.
	- Q4 brightness collapse — Q4 desaturates and darkens everything. Q8 fixes it. Ship Q8.
	- Hands — hands when present in scenes (e.g. tea master holding cup) look fine at moderate detail, but the model isn't immune to the standard diffusion hand failure modes; haven't stress-tested.
	- Dense long text — "BLOOM CAFE" is short and structured. A paragraph of text would likely fall apart.
	- Speed at 2048 — 4 minutes per image is slow for iterative work. Fine for a final pass.

	## Sweet spot

	1024×1024, Q6, default Dev recipe, ~36 s/image, ~8.5 GB RAM. Bright/colourful output equivalent to Q8, half the wall time, 30% less RAM. 512 is fast (~25 s) but loses detail. 2048 is gorgeous but iterative-unfriendly.

	Quant decision tree:
	- 16 GB Mac → don't run HiDream; use mflux Z-Image-Turbo
	- 32 GB Mac → Q6 is comfortable, Q8 leaves no headroom alongside LTX
	- 64 GB Mac → Q6 default; Q8 only when you want deterministic upper-bound RAM

	## A/B vs mflux Z-Image-Turbo

	Same prompts, same seeds, both at 1024×1024.

	\| # \| Prompt \| HiDream Q8 \| Z-Image-Turbo Q4 (mflux) \| Subjective winner \|
	\|---\|---\|---\|---\|---\|
	\| 1 \| tea master \| [v3](../sample_outputs/showcase/01_portrait_photo.png) — wide scene, paper screens, calligraphy \| [zimg](../sample_outputs/ab_mflux/01_portrait_zimage.png) — tighter portrait, gray garment, smile \| Tie — different framings, both excellent \|
	\| 2 \| sunlit beach \| [v3](../sample_outputs/v3_1024_beach_q8.png) — turquoise water, palm trees, beach chair \| [zimg](../sample_outputs/ab_mflux/02_beach_zimage.png) — vivid blue water, palms, big sand foreground \| Tie — both nail the prompt \|
	\| 3 \| alice cyberpunk \| [v3](../sample_outputs/v3_alice_horizontal_q8.png) (horizontal) — clear dress + face + Cheshire \| [zimg](../sample_outputs/ab_mflux/03_alice_zimage.png) — more painterly, atmospheric Cheshire silhouette \| HiDream for face/dress detail; Z-Image for atmosphere \|

	### Speed + RAM (measured, not estimated)

	\| Engine \| Steps \| Wall (1024) \| Per step \| Peak RAM \|
	\|---\|---\|---\|---\|---\|
	\| HiDream-O1-Dev / Q8 \| 28 \| 67 s \| 2.41 s \| 11.5 GB \|
	\| Z-Image-Turbo / Q4 \| 9 \| 80 s \| 8.85 s \| 5.9–29.4 GB (varies by prompt) \|

	Surprises:
	- HiDream is faster per image despite needing 28 steps vs Z-Image-Turbo's 9 — Z-Image's per-step cost is ~3.7× HiDream's.
	- Z-Image's peak RAM varied wildly across prompts (5.9 GB for portrait, 29.4 GB for the alice cyberpunk). HiDream's peak was steady at ~11.5 GB regardless of prompt complexity.

	### Verdict

	Both are excellent local engines. Pick by the workload:

	- Default/compact: keep Z-Image-Turbo — 5.9 GB RAM on most prompts, runs anywhere.
	- Hero shots / max prompt fidelity: HiDream-O1-Q8 — faster wall time, deterministic memory, more environmental detail in the output.
	- Editing / multi-ref: keep mflux qwen-edit — HiDream lab pipeline doesn't support refs yet.

	## Patch-grid post-blend experiment

	Implemented `--blend-seams <radius>` post-process in `generate_hidream_o1_mlx.py`: after decoding the final image, average a thin band across each 32-pixel patch boundary line (radius=1 → blend the seam row with one neighbour on each side, then 50% blend back into the seam itself).

	Result on the same beach prompt + seed 11 + Q8:

	\| Comparison \| Mean abs diff (out of 255) \|
	\|---\|---\|
	\| baseline vs blend r=1 \| 0.18 \|
	\| baseline vs blend r=2 \| 0.23 \|

	Per-row breakdown confirms the blend is surgical — only seam rows (every 32) change, by 1–2.7 pixel values; non-seam rows shift by <0.2. So the math is doing exactly what it says.

	But visually: at Q8 the seam artifact is already mild. The blend's 1–2 pixel-value smoothing is below visual threshold. No win, but no harm — and zero added latency (numpy vector ops on a 1024×1024 image are sub-ms).

	Bottom line: kept as opt-in flag `--blend-seams 1`. Did not enable by default. The real fix for the patch grid would need overlap-blended patches (architectural change) or a stronger spatial filter (which would visibly blur the image).

	## Software-side speed: nothing left

	Tested `mx.compile` on the forward pass: 0% improvement (2.366 s/step compiled vs 2.368 s/step uncompiled). The forward is already bandwidth-bound by the 36-layer Q8 decoder's matmul stream — MLX is already at near-GPU-saturation. Same conclusion for `mx.fast.scaled_dot_product_attention` (already used inside mlx-vlm's Qwen3VLAttention).

	The path to faster is architectural, not algorithmic:
	- Fewer steps (would need a smaller distillation; Dev is already the distilled variant)
	- Smaller backbone (would need re-distillation onto a 4B Qwen3-VL — no public version)
	- Caching the text-portion hidden states across denoising steps — possible but invasive (would need to subclass mlx-vlm's Qwen3VLModel; ~2-5% speedup at best since text is <2% of seq length)

	## Verdict

	- Working. Q8 produces real, prompt-faithful, high-quality images at ~67 s/1024.
	- No more easy speedups. The lab's inference loop is already at the floor for this architecture on this hardware.
	- Patch artifacts are real but mild. Low-frequency regions show a 32-pixel grid. Subjects-with-content scenes hide it well.
	- Q8 is the only acceptable quant. Q4 ships dark. If we ever want a smaller variant, would need different bit packing or selective Q6.

	## Recommendation for Phosphene

	Slot it in as a third local engine alongside `mflux Z-Image-Turbo` (compact tier) and `mflux FLUX.2-klein-4B` (comfortable tier). Mark HiDream as comfortable+ (32 GB+) due to the 11.5 GB working set. Don't make it the default — it's slower per image and uses more RAM than Z-Image-Turbo. Make it the option for users who want max prompt fidelity and license clarity (MIT, no NC restriction).

	See [PHOSPHENE_INTEGRATION_PLAN.md](PHOSPHENE_INTEGRATION_PLAN.md) for the patch.