HiDream-ai
/

HiDream-O1-Image-Dev

@@ -3,19 +3,22 @@ license: mit
 pipeline_tag: image-text-to-image
 library_name: transformers
 ---
 # HiDream-O1-Image
 HiDream-O1-Image is a natively unified image generative foundation model built on a Pixel-level Unified Transformer (UiT) without external VAEs or disjoint text encoders, which natively encodes raw pixels, text, and task-specific conditions in a single shared token space — supporting text-to-image, image editing, and subject-driven personalization at up to 2,048 × 2,048.
 ## Project Updates
-- 🛠️ **May 13, 2026:** Inference & pipeline updates — accelerated IP inference; the IP pipeline now supports **layout** and **skeleton** conditioning; updated the Dev editing scheduler. For editing tasks we recommend using the **full** model. PyTorch 2.9.x is not recommended due to the [issue](https://github.com/QwenLM/Qwen3-VL/issues/1811)
 - 🤗 **May 10, 2026:** Try **HiDream-O1-Image** online on Hugging Face Spaces — [🤗 HiDream-O1-Image](https://huggingface.co/spaces/HiDream-ai/HiDream-O1-Image) and [🤗 HiDream-O1-Image-Dev](https://huggingface.co/spaces/HiDream-ai/HiDream-O1-Image-Dev).
-- 📕 **May 10, 2026:** Our **technical report** is now available — [📑 HiDream-O1-Image.pdf](https://github.com/HiDream-ai/HiDream-O1-Image/blob/main/assets/HiDream-O1-Image.pdf).
 - 🚀 **May 8, 2026:** We've open-sourced **HiDream-O1-Image (8B)**, including both the undistilled and distilled Dev variants, together with the Reasoning-Driven Prompt Agent.
-> **HiDream-O1-Image (codename: Peanut) debuts at #8 in the Artificial Analysis Text to Image Arena, which is positioned to be the new leading open weights Text to Image model (2026-5-5).**
 <p align="center">
   <img src="assets/leaderboard.png" alt="Artificial Analysis Text to Image Arena" width="100%"/>
   <br><sub><b>Artificial Analysis Text to Image Arena</b> at up to 2,048 × 2,048.</sub>
@@ -36,7 +39,6 @@ HiDream-O1-Image is a natively unified image generative foundation model built o
   <br><sub><b>Subject-driven personalization</b> — preserve identity / IP across new scenes.</sub>
 </p>
 ## Key Features
 - 🧬 **Pixel-Level Unified Transformer** — One end-to-end model on raw pixels, no VAE, no disjoint text encoder.
@@ -45,14 +47,18 @@ HiDream-O1-Image is a natively unified image generative foundation model built o
 - 🖼️ **Native High Resolution** — Direct synthesis up to 2,048 × 2,048 with sharp fine-grained detail.
 - ⚡ **Exceptional Efficiency and Versatility at 8B Scale** — With only 8B parameters, achieves performance parity with or even surpasses larger open-source DiTs and leading closed-source models.
 ## Models
 | Name | Script | Inference Steps | HuggingFace Repo |
 | :--- | :--- | :---: | :--- |
-| HiDream-O1-Image | `inference.py` | 50 | [🤗 HiDream-O1-Image](https://huggingface.co/HiDream-ai/HiDream-O1-Image) |
-| HiDream-O1-Image-Dev | `inference.py` | 28 | [🤗 HiDream-O1-Image-Dev](https://huggingface.co/HiDream-ai/HiDream-O1-Image-Dev) |
-| Prompt Agent | `prompt_agent.py` | — | [🤗 google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it) |
-| Web Demo | `app.py` | — | — |
 ## Evaluation
@@ -189,7 +195,7 @@ cd HiDream-O1-Image
 pip install -r requirements.txt
 ```
-> **Note on `flash-attn`.** We highly recommend installing [`flash-attn`](https://github.com/Dao-AILab/flash-attention) for optimized attention computation. **If you do not (or cannot) install `flash-attn`, you must edit `models/pipeline.py` line 291 and change `"use_flash_attn": True` to `"use_flash_attn": False`** — otherwise inference will fail to import the kernel.
 ## Reasoning-Driven Prompt Agent
@@ -299,6 +305,8 @@ python inference.py \
     --model_type dev
 ```
 ### Command Line Arguments
 - `--model_path`: Path to the complete HuggingFace model directory (undistilled or distilled).
@@ -308,16 +316,17 @@ python inference.py \
 - `--height` / `--width`: Output image dimensions (default: `2048` × `2048`; values snap to valid resolutions internally).
 - `--model_type`: `full` or `dev` (default: `full`). Selects the inference recipe:
   - `full`: 50 steps, guidance scale `5.0`, shift `3.0`, default scheduler.
-  - `dev`: 28 steps, guidance scale `0.0`, shift `1.0`, flash scheduler with predefined timesteps.
 - `--seed`: Random seed (default: `32`).
 - `--guidance_scale`: Guidance scale (default: `5.0`). Only effective when `--model_type` is `full`.
-- `--noise_scale_start`, `--noise_scale_end`: Control the scale of the noise injected by the scheduler at each denoising step; the per-step scale linearly interpolates from `noise_scale_start` (first step) to `noise_scale_end` (last step). See `models/pipeline.py:262` and `models/pipeline.py:273`. Defaults: `7.5`, `7.5`.
-- `--noise_clip_std`: Per-step clipping threshold (in units of the injected noise's standard deviation) applied to the noise added during scheduler stepping. See `models/flash_scheduler.py:348-350`. Default: `2.5`.
 - `--keep_original_aspect`: When exactly one reference image is provided, resize it with `max_size=2048` and use its dimensions for the target image (preserves the reference's aspect ratio) if `True`.
 ## Web Demo
-`app.py` is a self-contained Flask web application that exposes all generation modes. It also integrates the Reasoning-Driven Prompt Agent.
 ### Starting the server
@@ -339,11 +348,24 @@ Then open `http://localhost:7860` in your browser.
 | `--host` | `0.0.0.0` | Bind address for the Flask server. |
 | `--port` | `7860` | Port for the Flask server. |
-All four arguments can also be set via environment variables (see `.env.example`): `HIDREAM_MODEL_PATH`, `HIDREAM_MODEL_TYPE`, `HIDREAM_HOST`, and `HIDREAM_PORT`.
 ### Prompt Agent in the UI
-The sidebar contains a Prompt Agent panel that calls the same Reasoning-Driven Prompt Agent used by `prompt_agent.py`.  Select either the *OpenAI-compatible API* backend (any endpoint, key, and model name) or the *Local · Gemma* backend (set `HIDREAM_AGENT_MODEL` in `.env` or the environment to point to your local Gemma-4-31B-it weights).
 ## License
 The code in this repository and the HiDream-O1-Image models are licensed under MIT License.

 pipeline_tag: image-text-to-image
 library_name: transformers
 ---
 # HiDream-O1-Image
 HiDream-O1-Image is a natively unified image generative foundation model built on a Pixel-level Unified Transformer (UiT) without external VAEs or disjoint text encoders, which natively encodes raw pixels, text, and task-specific conditions in a single shared token space — supporting text-to-image, image editing, and subject-driven personalization at up to 2,048 × 2,048.
 ## Project Updates
+- 🚀 **May 14, 2026:** We open-sourced [**HiDream-O1-Image-Dev-2604**](https://huggingface.co/HiDream-ai/HiDream-O1-Image-Dev-2604) with its [prompt refiner](https://huggingface.co/HiDream-ai/Prompt-Refine), tailored for text-to-image generation task.
+- 🛠️ **May 13, 2026:** Inference & pipeline updates — accelerated IP inference; the IP pipeline now supports **layout** and **skeleton** conditioning; updated the Dev editing scheduler. For editing tasks we recommend using the **full** model. PyTorch 2.9.x is not recommended due to the [issue](https://github.com/QwenLM/Qwen3-VL/issues/1811).
 - 🤗 **May 10, 2026:** Try **HiDream-O1-Image** online on Hugging Face Spaces — [🤗 HiDream-O1-Image](https://huggingface.co/spaces/HiDream-ai/HiDream-O1-Image) and [🤗 HiDream-O1-Image-Dev](https://huggingface.co/spaces/HiDream-ai/HiDream-O1-Image-Dev).
+- 📕 **May 10, 2026:** Our **technical report** is now available — [📑 HiDream-O1-Image.pdf](assets/HiDream-O1-Image.pdf).
 - 🚀 **May 8, 2026:** We've open-sourced **HiDream-O1-Image (8B)**, including both the undistilled and distilled Dev variants, together with the Reasoning-Driven Prompt Agent.
+<div align="center">
+  <video src="https://github.com/user-attachments/assets/cbbdb816-f050-4685-aa51-4741479a0e5c" width="70%" poster=""> </video>
+</div>
+> **HiDream-O1-Image-Dev-2604 debuts at #8 in the Artificial Analysis Text to Image Arena, which is positioned to be the new leading open weights Text to Image model.**
 <p align="center">
   <img src="assets/leaderboard.png" alt="Artificial Analysis Text to Image Arena" width="100%"/>
   <br><sub><b>Artificial Analysis Text to Image Arena</b> at up to 2,048 × 2,048.</sub>
   <br><sub><b>Subject-driven personalization</b> — preserve identity / IP across new scenes.</sub>
 </p>
 ## Key Features
 - 🧬 **Pixel-Level Unified Transformer** — One end-to-end model on raw pixels, no VAE, no disjoint text encoder.
 - 🖼️ **Native High Resolution** — Direct synthesis up to 2,048 × 2,048 with sharp fine-grained detail.
 - ⚡ **Exceptional Efficiency and Versatility at 8B Scale** — With only 8B parameters, achieves performance parity with or even surpasses larger open-source DiTs and leading closed-source models.
 ## Models
 | Name | Script | Inference Steps | HuggingFace Repo |
 | :--- | :--- | :---: | :--- |
+| HiDream-O1-Image | [`inference.py`](./inference.py) | 50 | [🤗 HiDream-O1-Image](https://huggingface.co/HiDream-ai/HiDream-O1-Image) |
+| HiDream-O1-Image-Dev | [`inference.py`](./inference.py) | 28 | [🤗 HiDream-O1-Image-Dev](https://huggingface.co/HiDream-ai/HiDream-O1-Image-Dev) |
+| Prompt Agent | [`prompt_agent.py`](./prompt_agent.py) | — | [🤗 google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it) |
+| Web Demo | [`app.py`](./app.py) | — | — |
+| HiDream-O1-Image-Dev-2604 | [`inference.py` (dev branch)](https://github.com/HiDream-ai/HiDream-O1-Image/blob/dev/inference.py) | 28 | [🤗 HiDream-O1-Image-Dev-2604](https://huggingface.co/HiDream-ai/HiDream-O1-Image-Dev-2604) |
+| Prompt Agent 2604 | [`prompt_agent_v2.py` (dev branch)](https://github.com/HiDream-ai/HiDream-O1-Image/blob/dev/prompt_agent_v2.py) | — | [🤗 HiDream-ai/Prompt-Refine](https://huggingface.co/HiDream-ai/Prompt-Refine) |
 ## Evaluation
 pip install -r requirements.txt
 ```
+> **Note on `flash-attn`.** We highly recommend installing [`flash-attn`](https://github.com/Dao-AILab/flash-attention) for optimized attention computation. **If you do not (or cannot) install `flash-attn`, you must edit `models/pipeline.py` line 341 and change `"use_flash_attn": True` to `"use_flash_attn": False`** — otherwise inference will fail to import the kernel.
 ## Reasoning-Driven Prompt Agent
     --model_type dev
 ```
+For **editing** tasks (exactly one reference image), the Dev model defaults to the `flow_match` scheduler. `flow_match` is recommended for editing tasks. Pass `--editing_scheduler flash` to use the flash scheduler instead. This flag has no effect on the `full` model or on non-editing tasks.
 ### Command Line Arguments
 - `--model_path`: Path to the complete HuggingFace model directory (undistilled or distilled).
 - `--height` / `--width`: Output image dimensions (default: `2048` × `2048`; values snap to valid resolutions internally).
 - `--model_type`: `full` or `dev` (default: `full`). Selects the inference recipe:
   - `full`: 50 steps, guidance scale `5.0`, shift `3.0`, default scheduler.
+  - `dev`: 28 steps, guidance scale `0.0`, shift `1.0`, flash scheduler with predefined timesteps. For editing tasks (exactly one reference image), the default scheduler is `flow_match` instead — see `--editing_scheduler`.
 - `--seed`: Random seed (default: `32`).
 - `--guidance_scale`: Guidance scale (default: `5.0`). Only effective when `--model_type` is `full`.
+- `--noise_scale_start`, `--noise_scale_end`: Control the scale of the noise injected by the scheduler at each denoising step; the per-step scale linearly interpolates from `noise_scale_start` (first step) to `noise_scale_end` (last step). See `models/pipeline.py:313` (initial noise) and `models/pipeline.py:323-326` (per-step linear interpolation). Defaults: `7.5`, `7.5`.
+- `--noise_clip_std`: Per-step clipping threshold (in units of the injected noise's standard deviation) applied to the noise added during scheduler stepping. See `models/flash_scheduler.py:350-354`. Default: `2.5`.
+- `--editing_scheduler`: Scheduler to use for editing tasks (exactly one reference image) when `--model_type dev`. Choices: `flow_match` (default) or `flash`. Ignored for the `full` model and for non-editing tasks.
 - `--keep_original_aspect`: When exactly one reference image is provided, resize it with `max_size=2048` and use its dimensions for the target image (preserves the reference's aspect ratio) if `True`.
 ## Web Demo
+`app.py` is a single-file Flask web UI (with HTML / CSS / JS embedded inline) that exposes all generation modes. It also integrates the Reasoning-Driven Prompt Agent.
 ### Starting the server
 | `--host` | `0.0.0.0` | Bind address for the Flask server. |
 | `--port` | `7860` | Port for the Flask server. |
+All four CLI arguments above can also be set via environment variables (see `.env.example`): `HIDREAM_MODEL_PATH`, `HIDREAM_MODEL_TYPE`, `HIDREAM_HOST`, and `HIDREAM_PORT`.
+The Prompt Agent panel in the Web Demo reads additional environment variables from `.env`:
+| Env Var | Used by | Description |
+| :--- | :--- | :--- |
+| `HIDREAM_AGENT_MODEL` | Local · Gemma backend | Path or HF repo id of the local Gemma weights. |
+| `OPENAI_BASE_URL` | OpenAI-compatible API backend | Default base URL pre-filled in the UI. |
+| `OPENAI_API_KEY` | OpenAI-compatible API backend | Default API key pre-filled in the UI. |
+| `OPENAI_MODEL` | OpenAI-compatible API backend | Default model name pre-filled in the UI. |
 ### Prompt Agent in the UI
+The sidebar contains a Prompt Agent panel that calls the same Reasoning-Driven Prompt Agent used by `prompt_agent.py`. Select either the *OpenAI-compatible API* backend (any endpoint, key, and model name) or the *Local · Gemma* backend (set `HIDREAM_AGENT_MODEL` in `.env` or the environment to point to your local Gemma-4-31B-it weights).
+### Editing Scheduler (Dev model only)
+When the server is launched with `--model_type dev`, the **Edit** tab exposes a *Scheduler* dropdown with two options: `flow_match` (default) and `flash`. The selector is hidden for the `full` model and for the Text → Image / Subject tabs, where the scheduler is fixed.
 ## License
 The code in this repository and the HiDream-O1-Image models are licensed under MIT License.