HiDream-O1-Image / README.md
TheViper's picture
Update README.md
f8182df verified
metadata
license: mit
pipeline_tag: image-text-to-image
library_name: transformers

HiDream-O1-Image

HiDream-O1-Image is a natively unified image generative foundation model built on a Pixel-level Unified Transformer (UiT) without external VAEs or disjoint text encoders, which natively encodes raw pixels, text, and task-specific conditions in a single shared token space β€” supporting text-to-image, image editing, and subject-driven personalization at up to 2,048 Γ— 2,048.

Project Updates

  • πŸ› οΈ May 13, 2026: Inference & pipeline updates β€” accelerated IP inference; the IP pipeline now supports layout and skeleton conditioning; updated the Dev editing scheduler. For editing tasks we recommend using the full model. PyTorch 2.9.x is not recommended due to the issue
  • πŸ€— May 10, 2026: Try HiDream-O1-Image online on Hugging Face Spaces β€” πŸ€— HiDream-O1-Image and πŸ€— HiDream-O1-Image-Dev.
  • πŸ“• May 10, 2026: Our technical report is now available β€” πŸ“‘ HiDream-O1-Image.pdf.
  • πŸš€ May 8, 2026: We've open-sourced HiDream-O1-Image (8B), including both the undistilled and distilled Dev variants, together with the Reasoning-Driven Prompt Agent.

HiDream-O1-Image (codename: Peanut) debuts at #8 in the Artificial Analysis Text to Image Arena, which is positioned to be the new leading open weights Text to Image model (2026-5-5).

Artificial Analysis Text to Image Arena
Artificial Analysis Text to Image Arena at up to 2,048 Γ— 2,048.

General text-to-image generation
General text-to-image generation at up to 2,048 Γ— 2,048.

Long-text rendering and layout
Long-text rendering & layout control β€” accurate, multi-region, multilingual text.

Subject-driven personalization
Subject-driven personalization β€” preserve identity / IP across new scenes.

Key Features

  • 🧬 Pixel-Level Unified Transformer β€” One end-to-end model on raw pixels, no VAE, no disjoint text encoder.
  • 🎨 One Model, Many Tasks β€” Text-to-image, long-text rendering, instruction editing, subject-driven personalization, and storyboard generation in a single architecture.
  • 🧠 Reasoning-Driven Prompt Agent β€” Built-in "thinking" agent that resolves implicit knowledge, layout, and text rendering before generation.
  • πŸ–ΌοΈ Native High Resolution β€” Direct synthesis up to 2,048 Γ— 2,048 with sharp fine-grained detail.
  • ⚑ Exceptional Efficiency and Versatility at 8B Scale β€” With only 8B parameters, achieves performance parity with or even surpasses larger open-source DiTs and leading closed-source models.

Models

Name Script Inference Steps HuggingFace Repo
HiDream-O1-Image inference.py 50 πŸ€— HiDream-O1-Image
HiDream-O1-Image-Dev inference.py 28 πŸ€— HiDream-O1-Image-Dev
Prompt Agent prompt_agent.py β€” πŸ€— google/gemma-4-31B-it
Web Demo app.py β€” β€”

Evaluation

We benchmark HiDream-O1-Image against state-of-the-art open-source and proprietary models on five widely-used evaluation suites covering compositional generation, dense prompt alignment, human preference, complex visual text generation, and long-text rendering. In each table, the best result is highlighted in bold and the second-best is underlined. Click any benchmark below to expand or collapse.

GenEval β€” compositional generation
Model #Params Single-Obj Two-Obj Count Color Position Attr Overall
Nano Banana 2.0 – 1.00 0.96 0.71 0.84 0.86 0.65 0.83
Seedream-4.0 – 1.00 0.92 0.71 0.93 0.78 0.68 0.84
GPT Image 1 [High] – 0.99 0.92 0.85 0.92 0.75 0.61 0.84
GPT Image 2 – 0.99 0.98 0.85 0.93 0.85 0.77 0.89
PixArt 4.3B + 0.6B 0.98 0.50 0.44 0.80 0.08 0.07 0.48
Show-o 1.3B 0.95 0.52 0.49 0.82 0.11 0.28 0.53
Emu3-Gen 8B 0.98 0.71 0.34 0.81 0.17 0.21 0.54
SD3-Medium 5.5B + 2B 0.98 0.74 0.63 0.67 0.34 0.36 0.62
JanusFlow 1.3B 0.97 0.59 0.45 0.83 0.53 0.42 0.63
FLUX.1 [Dev] 4.8B + 12B 0.98 0.81 0.74 0.79 0.22 0.45 0.66
SD3.5 Large 5.5B + 8.1B 0.98 0.89 0.73 0.83 0.34 0.47 0.71
Janus-Pro-7B 7B 0.99 0.89 0.59 0.90 0.79 0.66 0.80
Z-Image-Turbo 4B + 6B 1.00 0.95 0.77 0.89 0.65 0.68 0.82
FLUX.2 [Dev] 24B + 32B 1.00 0.99 0.79 0.93 0.73 0.78 0.87
Qwen-Image 7B + 20B 0.99 0.92 0.89 0.88 0.76 0.77 0.87
HiDream-O1-Image 8B 1.00 0.99 0.79 0.89 0.93 0.78 0.90
HiDream-O1-Image-Pro 200B+ 1.00 0.99 0.85 0.94 0.94 0.79 0.92
DPG-Bench β€” dense prompt alignment
Model #Params Global Entity Attribute Relation Other Overall
GPT Image 1 [High] – 88.89 88.94 89.84 92.63 90.96 85.15
GPT Image 2 – 87.27 91.91 90.85 91.59 91.58 85.98
Nano Banana 2.0 – 85.17 92.55 91.16 90.45 91.08 86.90
Seedream-4.0 – 87.17 92.41 92.29 93.33 95.48 88.63
SD v1.5 0.12B + 0.86B 74.63 74.23 75.39 73.49 67.81 63.18
PixArt 4.3B + 0.6B 74.97 79.32 78.60 82.57 76.96 71.11
Lumina-Next 2B + 2B 82.82 88.65 86.44 80.53 81.82 74.63
SDXL 0.81B + 2.6B 83.27 82.43 80.91 86.76 80.41 74.65
Hunyuan-DiT 4.8B + 1.5B 84.59 80.59 88.01 74.36 86.41 78.87
Emu3-Gen 8B 85.21 86.68 86.84 90.22 83.15 80.60
DALL-E 3 – 90.97 89.61 88.39 90.58 89.83 83.50
FLUX.1 [Dev] 4.8B + 12B 74.35 90.00 88.96 90.87 88.33 83.84
SD3 Medium 5.5B + 2B 87.90 91.01 88.83 80.70 88.68 84.08
Janus-Pro-7B 7B 86.90 88.90 89.40 89.32 89.48 84.19
Z-Image-Turbo 4B + 6B 91.29 89.59 90.14 92.16 88.68 84.86
HiDream-I1-Full 13.5B + 17B 76.44 90.22 89.48 93.74 91.83 85.89
FLUX.2 [Dev] 24B + 32B 92.20 91.36 93.28 93.52 89.72 87.57
Qwen-Image 7B + 20B 91.32 91.56 92.02 94.31 92.73 88.32
HiDream-O1-Image 8B 95.15 92.32 93.74 92.88 90.25 89.83
HiDream-O1-Image-Pro 200B+ 94.97 95.42 92.59 90.82 89.50 90.30
HPSv3 β€” human preference across 12 categories
Model #Params All Characters Arts Design Architecture Animals Natural Scenery Transportation Products Plants Food Science Others
Seedream-4.0 – 9.32 9.83 9.20 8.83 9.95 8.99 9.40 9.58 9.12 9.26 9.75 9.11 9.51
Nano Banana 2.0 – 10.01 10.18 9.18 9.58 10.96 9.71 10.04 10.38 10.36 10.14 10.61 9.14 9.89
GPT Image 2 – 10.21 10.75 9.91 10.15 10.59 10.05 10.29 10.17 10.26 10.07 10.75 10.05 10.00
Z-Image-Turbo 4B + 6B 8.35 8.98 8.29 7.65 9.26 8.51 8.33 8.81 7.83 8.46 8.64 7.93 8.57
FLUX.2 [Dev] 24B + 32B 9.28 10.23 9.56 8.80 9.73 9.43 9.21 9.44 8.93 9.23 9.82 8.67 9.11
Qwen-Image 7B + 20B 9.94 10.91 10.47 9.56 10.22 10.61 9.87 10.10 9.15 9.99 10.08 9.19 9.83
HiDream-O1-Image 8B 10.37 10.59 10.44 10.29 11.02 10.34 10.37 10.54 10.50 10.38 10.85 9.68 10.09
HiDream-O1-Image-Pro 200B+ 10.47 10.63 10.51 10.33 11.11 10.08 10.45 10.37 10.75 10.29 11.13 10.09 10.39
CVTG-2K β€” complex visual text generation (click to expand)
Model #Params 2 regions 3 regions 4 regions 5 regions Average NED CLIP Score
Nano Banana 2.0 – 0.7465 0.7720 0.8067 0.7980 0.7875 0.8945 0.7212
GPT Image 1 [High] – 0.8779 0.8659 0.8731 0.8218 0.8569 0.9478 0.7982
Seedream-4.0 – 0.8980 0.8949 0.9044 0.9015 0.9003 0.9511 0.8033
GPT Image 2 – 0.8904 0.8887 0.9101 0.9044 0.9003 0.9515 0.7798
TextDiffuser-2 0.12B + 0.9B 0.5322 0.3255 0.1787 0.0809 0.2326 0.4353 0.6765
RAG-Diffusion 4.8B + 12B 0.4388 0.3316 0.2116 0.1910 0.2648 0.4498 0.7797
AnyText 0.123B + 1.2B 0.0513 0.1739 0.1948 0.2249 0.1804 0.4675 0.7432
3DIS 0.81B + 2.6B 0.4495 0.3959 0.3880 0.3303 0.3813 0.6505 0.7767
FLUX.1 [Dev] 4.8B + 12B 0.6089 0.5531 0.4661 0.4316 0.4965 0.6879 0.7401
SD3.5 Large 5.5B + 8.1B 0.7293 0.6825 0.6574 0.5940 0.6548 0.8470 0.7797
TextCrafter 7B + 20B 0.7628 0.7628 0.7406 0.6977 0.7370 0.8679 0.7868
Qwen-Image 7B + 20B 0.8370 0.8364 0.8313 0.8158 0.8288 0.9116 0.8017
Z-Image-Turbo 4B + 6B 0.8872 0.8662 0.8628 0.8347 0.8585 0.9281 0.8048
FLUX.2 [Dev] 24B + 32B 0.9261 0.8897 0.8995 0.8732 0.8926 0.9475 0.8104
HiDream-O1-Image 8B 0.9085 0.9159 0.9216 0.9015 0.9128 0.9561 0.8076
HiDream-O1-Image-Pro 200B+ 0.9133 0.9221 0.9365 0.9175 0.9222 0.9628 0.8349
LongText-Bench β€” long-text rendering, EN & ZH (click to expand)
Model #Params LongText-Bench-EN LongText-Bench-ZH
Seedream-4.0 – 0.936 0.946
GPT Image 1 [High] – 0.956 0.619
GPT Image 2 – 0.960 0.961
Nano Banana 2.0 – 0.980 0.965
Janus-Pro-7B 7B 0.019 0.006
BLIP3-o 7B + 1.4B 0.021 0.018
Kolors 2.0 – 0.258 0.329
BAGEL 7B + 7B 0.373 0.310
OmniGen2 3B + 4B 0.561 0.059
X-Omni 7B 0.900 0.814
HiDream-I1-Full 13.5B + 17B 0.543 0.024
FLUX.1 [Dev] 4.8B + 12B 0.607 0.005
Z-Image-Turbo 4B + 6B 0.917 0.926
FLUX.2 [Dev] 24B + 32B 0.963 0.757
Qwen-Image 7B + 20B 0.943 0.946
HiDream-O1-Image 8B 0.979 0.978
HiDream-O1-Image-Pro 200B+ 0.982 0.980

Installation

  1. Clone this repository:
git clone https://github.com/HiDream-ai/HiDream-O1-Image.git
cd HiDream-O1-Image
  1. Install the required dependencies:
pip install -r requirements.txt

Note on flash-attn. We highly recommend installing flash-attn for optimized attention computation. If you do not (or cannot) install flash-attn, you must edit models/pipeline.py line 291 and change "use_flash_attn": True to "use_flash_attn": False β€” otherwise inference will fail to import the kernel.

Reasoning-Driven Prompt Agent

HiDream-O1-Image ships with a Reasoning-Driven Prompt Agent (prompt_agent.py) that explicitly reasons through layout, subject attributes, physical logic, and text-rendering details, then rewrites a raw user instruction into a self-contained English prompt. It supports two backends β€” pick one with --backend.

The agent prints a JSON object with three fields: prompt (rewritten English prompt), reasoning, and resolved_knowledge. Feed the prompt field into inference.py for best results on intricate, reasoning-heavy requests.

Option A β€” Local Backend (Gemma-4-31B-it)

  1. Download the Gemma weights (requires accepting the Gemma license on HuggingFace):
huggingface-cli download google/gemma-4-31B-it --local-dir /path/to/gemma-4-31B-it
  1. Run the refiner locally:
python prompt_agent.py \
    --backend local \
    --model_id /path/to/gemma-4-31B-it \
    --prompt "ζŽη™½ηš„ι™ε€œζ€ε†™εœ¨ε€ε’™δΈŠ"

Option B β€” External OpenAI-Compatible API

Use any OpenAI-compatible endpoint (OpenAI, Azure, vLLM, SGLang, DeepSeek, etc.) by providing --base_url, --api_key, and --model_name:

python prompt_agent.py \
    --backend api \
    --base_url https://api.openai.com/v1 \
    --api_key $OPENAI_API_KEY \
    --model_name deepseek-v4-pro \
    --prompt "ζŽη™½ηš„ι™ε€œζ€ε†™εœ¨ε€ε’™δΈŠ"

Usage

A CUDA-capable GPU is required for inference. The examples below use the undistilled model (--model_type full); see the last subsection for running the same tasks with the distilled model (--model_type dev).

1. Text-to-Image Generation

Generate an image from a text prompt:

python inference.py \
    --model_path /path/to/HiDream-O1-Image \
    --prompt "medium shot, eye-level, front view. A woman is seated in an ornate bedroom, illuminated by candlelight, with a calm and composed expression. The subject is a young woman with fair skin, light brown hair styled in an updo with loose tendrils framing her face, and blue eyes. She wears a cream-colored satin robe with delicate floral embroidery and lace trim along the neckline. Her ears are adorned with pearl drop earrings. She is seated on a bed with a dark, intricately carved wooden headboard. To her left, a wooden nightstand holds three lit white candles and a candelabra with multiple lit candles in the background. The bed is covered with patterned pillows and a dark, textured blanket. The walls are paneled with dark wood and feature a large, ornate tapestry with muted earth tones. The lighting creates soft highlights on her face and robe, with warm shadows cast across the room." \
    --output_image results/t2i.png \
    --height 2048 \
    --width 2048

2. Instruction-Based Image Editing

Provide a single reference image and an editing instruction:

python inference.py \
    --model_path /path/to/HiDream-O1-Image \
    --prompt "remove the earphones" \
    --ref_images assets/edit/test.jpg \
    --output_image results/edit.png \
    --keep_original_aspect

3. Multi-Reference Subject-Driven Personalization

Provide two or more reference images that define the subject(s), and a prompt that places them in a new scene:

python inference.py \
    --model_path /path/to/HiDream-O1-Image \
    --shift 1 \
    --prompt "A young boy with blonde hair stands on steps wearing light blue jeans, a white t-shirt with logo, and blue and white sneakers. He wears a brown cord necklace with beads, a black wristwatch with digital display, and carries a yellow fanny pack with white zipper. In his hand is a red boxing glove with white top, a teal plastic toy car, and a plastic toy figure of Captain America. He wears a straw hat with cream band. Natural light illuminates the scene." \
    --ref_images assets/IP/1.jpg assets/IP/2.jpg assets/IP/3.jpg assets/IP/4.jpg assets/IP/5.jpg assets/IP/6.jpg assets/IP/7.jpg assets/IP/8.jpg assets/IP/9.jpg assets/IP/10.jpg \
    --output_image results/subject.png

4. Multi-Reference Subject-Driven Personalization with Skeleton

python inference.py \
    --model_path /path/to/HiDream-O1-Image \
    --shift 1 \
    --seed 42 \
    --prompt "Create a realistic try-on image of the person wearing the provided clothing." \
    --ref_images assets/IP_skeleton/0.face.jpg assets/IP_skeleton/0.bg.jpg assets/IP_skeleton/0.openpose.jpg assets/IP_skeleton/0.part_1.jpg assets/IP_skeleton/0.part_2.jpg assets/IP_skeleton/0.part_3.jpg  \
    --output_image results/subject.png

5. Multi-Reference Subject-Driven Personalization with Layout

python inference.py \
    --model_path /path/to/HiDream-O1-Image \
    --shift 1 \
    --seed 42 \
    --prompt "City council members pose with relaxed smiles on a sunlit terrace, warm approachable mood, golden hour, cinematic soft glow." \
    --ref_images assets/IP_layout/0.jpg assets/IP_layout/1.jpg \
    --layout_bboxes "[[0.20507812, 0.43945312, 0.48828125, 0.7421875 ], [0.57617188, 0.80078125, 0.08789062, 0.34179688]]" \
    --output_image results/ip_layout.png

6. Running with the Dev Model

All three tasks above can be run with the Dev model by switching --model_path to the Dev checkpoint and setting --model_type dev. For example:

python inference.py \
    --model_path /path/to/HiDream-O1-Image-Dev \
    --prompt "A dog holds a sign that says \"HiDream-O1-Image release.\"" \
    --output_image results/t2i_dev.png \
    --model_type dev

Command Line Arguments

  • --model_path: Path to the complete HuggingFace model directory (undistilled or distilled).
  • --prompt: Text prompt for the generation or editing task.
  • --ref_images: Paths to one or more reference images (optional; space-separated).
  • --output_image: Path to save the generated image (default: output.png).
  • --height / --width: Output image dimensions (default: 2048 Γ— 2048; values snap to valid resolutions internally).
  • --model_type: full or dev (default: full). Selects the inference recipe:
    • full: 50 steps, guidance scale 5.0, shift 3.0, default scheduler.
    • dev: 28 steps, guidance scale 0.0, shift 1.0, flash scheduler with predefined timesteps.
  • --seed: Random seed (default: 32).
  • --guidance_scale: Guidance scale (default: 5.0). Only effective when --model_type is full.
  • --noise_scale_start, --noise_scale_end: Control the scale of the noise injected by the scheduler at each denoising step; the per-step scale linearly interpolates from noise_scale_start (first step) to noise_scale_end (last step). See models/pipeline.py:262 and models/pipeline.py:273. Defaults: 7.5, 7.5.
  • --noise_clip_std: Per-step clipping threshold (in units of the injected noise's standard deviation) applied to the noise added during scheduler stepping. See models/flash_scheduler.py:348-350. Default: 2.5.
  • --keep_original_aspect: When exactly one reference image is provided, resize it with max_size=2048 and use its dimensions for the target image (preserves the reference's aspect ratio) if True.

Web Demo

app.py is a self-contained Flask web application that exposes all generation modes. It also integrates the Reasoning-Driven Prompt Agent.

Starting the server

python app.py \
    --model_path /path/to/HiDream-O1-Image \
    --host 0.0.0.0 \
    --port 7860

Then open http://localhost:7860 in your browser.

Command-line arguments

Argument Default Description
--model_path $HIDREAM_MODEL_PATH Path to the checkpoint directory (HiDream-O1-Image or HiDream-O1-Image-Dev).
--model_type full full (50-step) or dev (28-step).
--host 0.0.0.0 Bind address for the Flask server.
--port 7860 Port for the Flask server.

All four arguments can also be set via environment variables (see .env.example): HIDREAM_MODEL_PATH, HIDREAM_MODEL_TYPE, HIDREAM_HOST, and HIDREAM_PORT.

Prompt Agent in the UI

The sidebar contains a Prompt Agent panel that calls the same Reasoning-Driven Prompt Agent used by prompt_agent.py. Select either the OpenAI-compatible API backend (any endpoint, key, and model name) or the Local Β· Gemma backend (set HIDREAM_AGENT_MODEL in .env or the environment to point to your local Gemma-4-31B-it weights).

License

The code in this repository and the HiDream-O1-Image models are licensed under MIT License.