Mrbizarro's picture
Initial release: code, docs, hero samples
ffe929e verified

HiDream-O1-Image-Dev (Q8 MLX) β€” evaluation

Setup: lab branch perf-lab-hidream-o1-mlx, mlx-vlm 0.5.0 + mlx 0.31.2, Mac Studio (64 GB). Recipe: Dev β€” 28 steps, FlashFlowMatch, s_noise=7.5, noise_clip_std=2.5, shift=1.0. All times are honest wall-clock with mx.eval per step. All RAM is peak maximum resident set size.

Q6 showcase verification (2026-05-09 evening)

Re-ran the same 10-prompt battery at Q6 with identical seeds. All 10 are visually equivalent or better than the Q8 versions:

  • 9/10 are near-pixel-identical aesthetics (different latent noise from quant differences yields same compositions / lighting / subjects)
  • 10 (text rendering) is visibly better at Q6 β€” "BLOOM CAFE" neon sign is crisp at Q6 vs a glitched "M" at Q8

Per-image timing was rock-steady at 35.9 s (1.28 s/step). Total battery time: ~6 minutes vs ~12 minutes at Q8.

Outputs: sample_outputs/showcase_q6/ (compare against sample_outputs/showcase/ for the Q8 originals).

Battery: 10 prompts, 1024Γ—1024, all Q8

# Genre Prompt summary Result Time
01 photo portrait elderly Japanese tea master Excellent β€” face character, gentle smile, paper screens, calligraphy 81.5 s*
02 anime / illustration pink-haired girl on Tokyo rooftop at dusk Excellent β€” anime style + cherry blossoms + neon city below 65.3 s
03 macro photo dewdrop on spiderweb Excellent β€” refractions, blurred leaf bg, crisp web detail 65.9 s
04 architecture futuristic library, holographic displays Excellent β€” vaulted ceiling, stained glass, holo screens 66.3 s
05 surreal painting whale floating over desert at sunset Excellent β€” magical realism, painterly clouds 65.8 s
06 food flatlay rustic Italian breakfast on marble Excellent β€” golden croissants, espresso, berries, soft light 66.4 s
07 cinematic action samurai mid-leap with katana, Mt. Fuji bg Excellent β€” dynamic pose, cherry blossoms, real mountain 66.1 s
08 fantasy dragon on crystal mountain with aurora Excellent β€” iridescent scales, snow swirling, aurora visible 66.4 s
09 wildlife photo snow leopard staring at camera Excellent β€” direct gaze, falling snow, mountain bg 67.1 s
10 text rendering "BLOOM CAFE" pink neon diner Good β€” sign legible (small "M" glitch), retro diner, rainy street 67.1 s

*Image 01 included cold model load (~12-15 s).

Steady-state per-image: 65-67 s at 1024Γ—1024 Q8. Dead-consistent across genres.

Honest timings

Resolution Quant Per step Total (28 steps) Peak RAM
512Γ—512 Q4 0.89 s 24.9 s ~6 GB
1024Γ—1024 Q4 2.37 s 66 s ~6 GB
1024Γ—1024 Q6 1.30 s 36 s ~8.5 GB
1024Γ—1024 Q8 2.36 s 66 s ~11.5 GB
1280Γ—704 Q8 2.53 s 70.7 s ~7 GB
704Γ—1280 Q8 2.35 s 65.9 s ~3 GB (warm cache)
2048Γ—2048 Q4 8.44 s 236 s ~7.2 GB
2048Γ—2048 Q8 9.86 s 276 s ~10.8 GB

Q6 is the sweet spot. 2Γ— faster than Q8 at 1024 with the same prompt fidelity (cat in sunlit kitchen + beach with palm trees both rendered identically to Q8 outputs). 30% less RAM. The bandwidth-bound theory holds: fewer bits per param β†’ less weight bandwidth β†’ faster per-step.

Q4 corrupts brightness (ships dark) so the speed of Q4 vs Q6 is academic β€” never use Q4 for production. Q6 has the speed and Q8 has the steady-state safety; Q6 wins on perf, Q8 wins on a deterministic upper bound on RAM.

Where HiDream-O1-Image-Dev shines

  • Subject identity β€” every prompt subject was rendered correctly. No "vibrant orange tabby" β†’ cat-shape-blob. The model knows what things look like.
  • Multi-element scenes β€” samurai + Fuji + cherry blossoms; cyberpunk Alice + neon Cheshire cat + circuit dress + rain. Composition stays coherent.
  • Style adherence β€” anime β‰  photorealism β‰  oil painting β‰  macro. Got all four right.
  • Light realism β€” the architecture image's light through stained glass; the food flatlay's morning warmth; the action scene's sunset rim lighting. Light feels real, not stamped on.
  • Text rendering (limited) β€” "BLOOM CAFE" in neon was readable. Better than most diffusion models; not as clean as a model with explicit OCR pretraining.

Where it's weak

  • Patch-grid artifact in flat regions. PATCH_SIZE=32 with no overlap β†’ visible 32Γ—32 grid in skies, water, walls. Most visible at low-frequency content. Architectural β€” not fixable without retraining or an overlap-blending postprocess.
  • Q4 brightness collapse β€” Q4 desaturates and darkens everything. Q8 fixes it. Ship Q8.
  • Hands β€” hands when present in scenes (e.g. tea master holding cup) look fine at moderate detail, but the model isn't immune to the standard diffusion hand failure modes; haven't stress-tested.
  • Dense long text β€” "BLOOM CAFE" is short and structured. A paragraph of text would likely fall apart.
  • Speed at 2048 β€” 4 minutes per image is slow for iterative work. Fine for a final pass.

Sweet spot

1024Γ—1024, Q6, default Dev recipe, ~36 s/image, ~8.5 GB RAM. Bright/colourful output equivalent to Q8, half the wall time, 30% less RAM. 512 is fast (~25 s) but loses detail. 2048 is gorgeous but iterative-unfriendly.

Quant decision tree:

  • 16 GB Mac β†’ don't run HiDream; use mflux Z-Image-Turbo
  • 32 GB Mac β†’ Q6 is comfortable, Q8 leaves no headroom alongside LTX
  • 64 GB Mac β†’ Q6 default; Q8 only when you want deterministic upper-bound RAM

A/B vs mflux Z-Image-Turbo

Same prompts, same seeds, both at 1024Γ—1024.

# Prompt HiDream Q8 Z-Image-Turbo Q4 (mflux) Subjective winner
1 tea master v3 β€” wide scene, paper screens, calligraphy zimg β€” tighter portrait, gray garment, smile Tie β€” different framings, both excellent
2 sunlit beach v3 β€” turquoise water, palm trees, beach chair zimg β€” vivid blue water, palms, big sand foreground Tie β€” both nail the prompt
3 alice cyberpunk v3 (horizontal) β€” clear dress + face + Cheshire zimg β€” more painterly, atmospheric Cheshire silhouette HiDream for face/dress detail; Z-Image for atmosphere

Speed + RAM (measured, not estimated)

Engine Steps Wall (1024) Per step Peak RAM
HiDream-O1-Dev / Q8 28 67 s 2.41 s 11.5 GB
Z-Image-Turbo / Q4 9 80 s 8.85 s 5.9–29.4 GB (varies by prompt)

Surprises:

  • HiDream is faster per image despite needing 28 steps vs Z-Image-Turbo's 9 β€” Z-Image's per-step cost is ~3.7Γ— HiDream's.
  • Z-Image's peak RAM varied wildly across prompts (5.9 GB for portrait, 29.4 GB for the alice cyberpunk). HiDream's peak was steady at ~11.5 GB regardless of prompt complexity.

Verdict

Both are excellent local engines. Pick by the workload:

  • Default/compact: keep Z-Image-Turbo β€” 5.9 GB RAM on most prompts, runs anywhere.
  • Hero shots / max prompt fidelity: HiDream-O1-Q8 β€” faster wall time, deterministic memory, more environmental detail in the output.
  • Editing / multi-ref: keep mflux qwen-edit β€” HiDream lab pipeline doesn't support refs yet.

Patch-grid post-blend experiment

Implemented --blend-seams <radius> post-process in generate_hidream_o1_mlx.py: after decoding the final image, average a thin band across each 32-pixel patch boundary line (radius=1 β†’ blend the seam row with one neighbour on each side, then 50% blend back into the seam itself).

Result on the same beach prompt + seed 11 + Q8:

Comparison Mean abs diff (out of 255)
baseline vs blend r=1 0.18
baseline vs blend r=2 0.23

Per-row breakdown confirms the blend is surgical β€” only seam rows (every 32) change, by 1–2.7 pixel values; non-seam rows shift by <0.2. So the math is doing exactly what it says.

But visually: at Q8 the seam artifact is already mild. The blend's 1–2 pixel-value smoothing is below visual threshold. No win, but no harm β€” and zero added latency (numpy vector ops on a 1024Γ—1024 image are sub-ms).

Bottom line: kept as opt-in flag --blend-seams 1. Did not enable by default. The real fix for the patch grid would need overlap-blended patches (architectural change) or a stronger spatial filter (which would visibly blur the image).

Software-side speed: nothing left

Tested mx.compile on the forward pass: 0% improvement (2.366 s/step compiled vs 2.368 s/step uncompiled). The forward is already bandwidth-bound by the 36-layer Q8 decoder's matmul stream β€” MLX is already at near-GPU-saturation. Same conclusion for mx.fast.scaled_dot_product_attention (already used inside mlx-vlm's Qwen3VLAttention).

The path to faster is architectural, not algorithmic:

  • Fewer steps (would need a smaller distillation; Dev is already the distilled variant)
  • Smaller backbone (would need re-distillation onto a 4B Qwen3-VL β€” no public version)
  • Caching the text-portion hidden states across denoising steps β€” possible but invasive (would need to subclass mlx-vlm's Qwen3VLModel; ~2-5% speedup at best since text is <2% of seq length)

Verdict

  • Working. Q8 produces real, prompt-faithful, high-quality images at ~67 s/1024.
  • No more easy speedups. The lab's inference loop is already at the floor for this architecture on this hardware.
  • Patch artifacts are real but mild. Low-frequency regions show a 32-pixel grid. Subjects-with-content scenes hide it well.
  • Q8 is the only acceptable quant. Q4 ships dark. If we ever want a smaller variant, would need different bit packing or selective Q6.

Recommendation for Phosphene

Slot it in as a third local engine alongside mflux Z-Image-Turbo (compact tier) and mflux FLUX.2-klein-4B (comfortable tier). Mark HiDream as comfortable+ (32 GB+) due to the 11.5 GB working set. Don't make it the default β€” it's slower per image and uses more RAM than Z-Image-Turbo. Make it the option for users who want max prompt fidelity and license clarity (MIT, no NC restriction).

See PHOSPHENE_INTEGRATION_PLAN.md for the patch.