cai-qi commited on
Commit
aa5fa3a
Β·
verified Β·
1 Parent(s): 3a0517a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -16
README.md CHANGED
@@ -3,19 +3,22 @@ license: mit
3
  pipeline_tag: image-text-to-image
4
  library_name: transformers
5
  ---
6
-
7
  # HiDream-O1-Image
8
 
9
  HiDream-O1-Image is a natively unified image generative foundation model built on a Pixel-level Unified Transformer (UiT) without external VAEs or disjoint text encoders, which natively encodes raw pixels, text, and task-specific conditions in a single shared token space β€” supporting text-to-image, image editing, and subject-driven personalization at up to 2,048 Γ— 2,048.
10
 
11
  ## Project Updates
12
- - πŸ› οΈ **May 13, 2026:** Inference & pipeline updates β€” accelerated IP inference; the IP pipeline now supports **layout** and **skeleton** conditioning; updated the Dev editing scheduler. For editing tasks we recommend using the **full** model. PyTorch 2.9.x is not recommended due to the [issue](https://github.com/QwenLM/Qwen3-VL/issues/1811)
 
13
  - πŸ€— **May 10, 2026:** Try **HiDream-O1-Image** online on Hugging Face Spaces β€” [πŸ€— HiDream-O1-Image](https://huggingface.co/spaces/HiDream-ai/HiDream-O1-Image) and [πŸ€— HiDream-O1-Image-Dev](https://huggingface.co/spaces/HiDream-ai/HiDream-O1-Image-Dev).
14
- - πŸ“• **May 10, 2026:** Our **technical report** is now available β€” [πŸ“‘ HiDream-O1-Image.pdf](https://github.com/HiDream-ai/HiDream-O1-Image/blob/main/assets/HiDream-O1-Image.pdf).
15
  - πŸš€ **May 8, 2026:** We've open-sourced **HiDream-O1-Image (8B)**, including both the undistilled and distilled Dev variants, together with the Reasoning-Driven Prompt Agent.
16
 
17
- > **HiDream-O1-Image (codename: Peanut) debuts at #8 in the Artificial Analysis Text to Image Arena, which is positioned to be the new leading open weights Text to Image model (2026-5-5).**
 
 
18
 
 
19
  <p align="center">
20
  <img src="assets/leaderboard.png" alt="Artificial Analysis Text to Image Arena" width="100%"/>
21
  <br><sub><b>Artificial Analysis Text to Image Arena</b> at up to 2,048 Γ— 2,048.</sub>
@@ -36,7 +39,6 @@ HiDream-O1-Image is a natively unified image generative foundation model built o
36
  <br><sub><b>Subject-driven personalization</b> β€” preserve identity / IP across new scenes.</sub>
37
  </p>
38
 
39
-
40
  ## Key Features
41
 
42
  - 🧬 **Pixel-Level Unified Transformer** β€” One end-to-end model on raw pixels, no VAE, no disjoint text encoder.
@@ -45,14 +47,18 @@ HiDream-O1-Image is a natively unified image generative foundation model built o
45
  - πŸ–ΌοΈ **Native High Resolution** β€” Direct synthesis up to 2,048 Γ— 2,048 with sharp fine-grained detail.
46
  - ⚑ **Exceptional Efficiency and Versatility at 8B Scale** β€” With only 8B parameters, achieves performance parity with or even surpasses larger open-source DiTs and leading closed-source models.
47
 
 
 
48
  ## Models
49
 
50
  | Name | Script | Inference Steps | HuggingFace Repo |
51
  | :--- | :--- | :---: | :--- |
52
- | HiDream-O1-Image | `inference.py` | 50 | [πŸ€— HiDream-O1-Image](https://huggingface.co/HiDream-ai/HiDream-O1-Image) |
53
- | HiDream-O1-Image-Dev | `inference.py` | 28 | [πŸ€— HiDream-O1-Image-Dev](https://huggingface.co/HiDream-ai/HiDream-O1-Image-Dev) |
54
- | Prompt Agent | `prompt_agent.py` | β€” | [πŸ€— google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it) |
55
- | Web Demo | `app.py` | β€” | β€” |
 
 
56
 
57
  ## Evaluation
58
 
@@ -189,7 +195,7 @@ cd HiDream-O1-Image
189
  pip install -r requirements.txt
190
  ```
191
 
192
- > **Note on `flash-attn`.** We highly recommend installing [`flash-attn`](https://github.com/Dao-AILab/flash-attention) for optimized attention computation. **If you do not (or cannot) install `flash-attn`, you must edit `models/pipeline.py` line 291 and change `"use_flash_attn": True` to `"use_flash_attn": False`** β€” otherwise inference will fail to import the kernel.
193
 
194
  ## Reasoning-Driven Prompt Agent
195
 
@@ -299,6 +305,8 @@ python inference.py \
299
  --model_type dev
300
  ```
301
 
 
 
302
  ### Command Line Arguments
303
 
304
  - `--model_path`: Path to the complete HuggingFace model directory (undistilled or distilled).
@@ -308,16 +316,17 @@ python inference.py \
308
  - `--height` / `--width`: Output image dimensions (default: `2048` Γ— `2048`; values snap to valid resolutions internally).
309
  - `--model_type`: `full` or `dev` (default: `full`). Selects the inference recipe:
310
  - `full`: 50 steps, guidance scale `5.0`, shift `3.0`, default scheduler.
311
- - `dev`: 28 steps, guidance scale `0.0`, shift `1.0`, flash scheduler with predefined timesteps.
312
  - `--seed`: Random seed (default: `32`).
313
  - `--guidance_scale`: Guidance scale (default: `5.0`). Only effective when `--model_type` is `full`.
314
- - `--noise_scale_start`, `--noise_scale_end`: Control the scale of the noise injected by the scheduler at each denoising step; the per-step scale linearly interpolates from `noise_scale_start` (first step) to `noise_scale_end` (last step). See `models/pipeline.py:262` and `models/pipeline.py:273`. Defaults: `7.5`, `7.5`.
315
- - `--noise_clip_std`: Per-step clipping threshold (in units of the injected noise's standard deviation) applied to the noise added during scheduler stepping. See `models/flash_scheduler.py:348-350`. Default: `2.5`.
 
316
  - `--keep_original_aspect`: When exactly one reference image is provided, resize it with `max_size=2048` and use its dimensions for the target image (preserves the reference's aspect ratio) if `True`.
317
 
318
  ## Web Demo
319
 
320
- `app.py` is a self-contained Flask web application that exposes all generation modes. It also integrates the Reasoning-Driven Prompt Agent.
321
 
322
  ### Starting the server
323
 
@@ -339,11 +348,24 @@ Then open `http://localhost:7860` in your browser.
339
  | `--host` | `0.0.0.0` | Bind address for the Flask server. |
340
  | `--port` | `7860` | Port for the Flask server. |
341
 
342
- All four arguments can also be set via environment variables (see `.env.example`): `HIDREAM_MODEL_PATH`, `HIDREAM_MODEL_TYPE`, `HIDREAM_HOST`, and `HIDREAM_PORT`.
 
 
 
 
 
 
 
 
 
343
 
344
  ### Prompt Agent in the UI
345
 
346
- The sidebar contains a Prompt Agent panel that calls the same Reasoning-Driven Prompt Agent used by `prompt_agent.py`. Select either the *OpenAI-compatible API* backend (any endpoint, key, and model name) or the *Local Β· Gemma* backend (set `HIDREAM_AGENT_MODEL` in `.env` or the environment to point to your local Gemma-4-31B-it weights).
 
 
 
 
347
 
348
  ## License
349
  The code in this repository and the HiDream-O1-Image models are licensed under MIT License.
 
3
  pipeline_tag: image-text-to-image
4
  library_name: transformers
5
  ---
 
6
  # HiDream-O1-Image
7
 
8
  HiDream-O1-Image is a natively unified image generative foundation model built on a Pixel-level Unified Transformer (UiT) without external VAEs or disjoint text encoders, which natively encodes raw pixels, text, and task-specific conditions in a single shared token space β€” supporting text-to-image, image editing, and subject-driven personalization at up to 2,048 Γ— 2,048.
9
 
10
  ## Project Updates
11
+ - πŸš€ **May 14, 2026:** We open-sourced [**HiDream-O1-Image-Dev-2604**](https://huggingface.co/HiDream-ai/HiDream-O1-Image-Dev-2604) with its [prompt refiner](https://huggingface.co/HiDream-ai/Prompt-Refine), tailored for text-to-image generation task.
12
+ - πŸ› οΈ **May 13, 2026:** Inference & pipeline updates β€” accelerated IP inference; the IP pipeline now supports **layout** and **skeleton** conditioning; updated the Dev editing scheduler. For editing tasks we recommend using the **full** model. PyTorch 2.9.x is not recommended due to the [issue](https://github.com/QwenLM/Qwen3-VL/issues/1811).
13
  - πŸ€— **May 10, 2026:** Try **HiDream-O1-Image** online on Hugging Face Spaces β€” [πŸ€— HiDream-O1-Image](https://huggingface.co/spaces/HiDream-ai/HiDream-O1-Image) and [πŸ€— HiDream-O1-Image-Dev](https://huggingface.co/spaces/HiDream-ai/HiDream-O1-Image-Dev).
14
+ - πŸ“• **May 10, 2026:** Our **technical report** is now available β€” [πŸ“‘ HiDream-O1-Image.pdf](assets/HiDream-O1-Image.pdf).
15
  - πŸš€ **May 8, 2026:** We've open-sourced **HiDream-O1-Image (8B)**, including both the undistilled and distilled Dev variants, together with the Reasoning-Driven Prompt Agent.
16
 
17
+ <div align="center">
18
+ <video src="https://github.com/user-attachments/assets/cbbdb816-f050-4685-aa51-4741479a0e5c" width="70%" poster=""> </video>
19
+ </div>
20
 
21
+ > **HiDream-O1-Image-Dev-2604 debuts at #8 in the Artificial Analysis Text to Image Arena, which is positioned to be the new leading open weights Text to Image model.**
22
  <p align="center">
23
  <img src="assets/leaderboard.png" alt="Artificial Analysis Text to Image Arena" width="100%"/>
24
  <br><sub><b>Artificial Analysis Text to Image Arena</b> at up to 2,048 Γ— 2,048.</sub>
 
39
  <br><sub><b>Subject-driven personalization</b> β€” preserve identity / IP across new scenes.</sub>
40
  </p>
41
 
 
42
  ## Key Features
43
 
44
  - 🧬 **Pixel-Level Unified Transformer** β€” One end-to-end model on raw pixels, no VAE, no disjoint text encoder.
 
47
  - πŸ–ΌοΈ **Native High Resolution** β€” Direct synthesis up to 2,048 Γ— 2,048 with sharp fine-grained detail.
48
  - ⚑ **Exceptional Efficiency and Versatility at 8B Scale** β€” With only 8B parameters, achieves performance parity with or even surpasses larger open-source DiTs and leading closed-source models.
49
 
50
+
51
+
52
  ## Models
53
 
54
  | Name | Script | Inference Steps | HuggingFace Repo |
55
  | :--- | :--- | :---: | :--- |
56
+ | HiDream-O1-Image | [`inference.py`](./inference.py) | 50 | [πŸ€— HiDream-O1-Image](https://huggingface.co/HiDream-ai/HiDream-O1-Image) |
57
+ | HiDream-O1-Image-Dev | [`inference.py`](./inference.py) | 28 | [πŸ€— HiDream-O1-Image-Dev](https://huggingface.co/HiDream-ai/HiDream-O1-Image-Dev) |
58
+ | Prompt Agent | [`prompt_agent.py`](./prompt_agent.py) | β€” | [πŸ€— google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it) |
59
+ | Web Demo | [`app.py`](./app.py) | β€” | β€” |
60
+ | HiDream-O1-Image-Dev-2604 | [`inference.py` (dev branch)](https://github.com/HiDream-ai/HiDream-O1-Image/blob/dev/inference.py) | 28 | [πŸ€— HiDream-O1-Image-Dev-2604](https://huggingface.co/HiDream-ai/HiDream-O1-Image-Dev-2604) |
61
+ | Prompt Agent 2604 | [`prompt_agent_v2.py` (dev branch)](https://github.com/HiDream-ai/HiDream-O1-Image/blob/dev/prompt_agent_v2.py) | β€” | [πŸ€— HiDream-ai/Prompt-Refine](https://huggingface.co/HiDream-ai/Prompt-Refine) |
62
 
63
  ## Evaluation
64
 
 
195
  pip install -r requirements.txt
196
  ```
197
 
198
+ > **Note on `flash-attn`.** We highly recommend installing [`flash-attn`](https://github.com/Dao-AILab/flash-attention) for optimized attention computation. **If you do not (or cannot) install `flash-attn`, you must edit `models/pipeline.py` line 341 and change `"use_flash_attn": True` to `"use_flash_attn": False`** β€” otherwise inference will fail to import the kernel.
199
 
200
  ## Reasoning-Driven Prompt Agent
201
 
 
305
  --model_type dev
306
  ```
307
 
308
+ For **editing** tasks (exactly one reference image), the Dev model defaults to the `flow_match` scheduler. `flow_match` is recommended for editing tasks. Pass `--editing_scheduler flash` to use the flash scheduler instead. This flag has no effect on the `full` model or on non-editing tasks.
309
+
310
  ### Command Line Arguments
311
 
312
  - `--model_path`: Path to the complete HuggingFace model directory (undistilled or distilled).
 
316
  - `--height` / `--width`: Output image dimensions (default: `2048` Γ— `2048`; values snap to valid resolutions internally).
317
  - `--model_type`: `full` or `dev` (default: `full`). Selects the inference recipe:
318
  - `full`: 50 steps, guidance scale `5.0`, shift `3.0`, default scheduler.
319
+ - `dev`: 28 steps, guidance scale `0.0`, shift `1.0`, flash scheduler with predefined timesteps. For editing tasks (exactly one reference image), the default scheduler is `flow_match` instead β€” see `--editing_scheduler`.
320
  - `--seed`: Random seed (default: `32`).
321
  - `--guidance_scale`: Guidance scale (default: `5.0`). Only effective when `--model_type` is `full`.
322
+ - `--noise_scale_start`, `--noise_scale_end`: Control the scale of the noise injected by the scheduler at each denoising step; the per-step scale linearly interpolates from `noise_scale_start` (first step) to `noise_scale_end` (last step). See `models/pipeline.py:313` (initial noise) and `models/pipeline.py:323-326` (per-step linear interpolation). Defaults: `7.5`, `7.5`.
323
+ - `--noise_clip_std`: Per-step clipping threshold (in units of the injected noise's standard deviation) applied to the noise added during scheduler stepping. See `models/flash_scheduler.py:350-354`. Default: `2.5`.
324
+ - `--editing_scheduler`: Scheduler to use for editing tasks (exactly one reference image) when `--model_type dev`. Choices: `flow_match` (default) or `flash`. Ignored for the `full` model and for non-editing tasks.
325
  - `--keep_original_aspect`: When exactly one reference image is provided, resize it with `max_size=2048` and use its dimensions for the target image (preserves the reference's aspect ratio) if `True`.
326
 
327
  ## Web Demo
328
 
329
+ `app.py` is a single-file Flask web UI (with HTML / CSS / JS embedded inline) that exposes all generation modes. It also integrates the Reasoning-Driven Prompt Agent.
330
 
331
  ### Starting the server
332
 
 
348
  | `--host` | `0.0.0.0` | Bind address for the Flask server. |
349
  | `--port` | `7860` | Port for the Flask server. |
350
 
351
+ All four CLI arguments above can also be set via environment variables (see `.env.example`): `HIDREAM_MODEL_PATH`, `HIDREAM_MODEL_TYPE`, `HIDREAM_HOST`, and `HIDREAM_PORT`.
352
+
353
+ The Prompt Agent panel in the Web Demo reads additional environment variables from `.env`:
354
+
355
+ | Env Var | Used by | Description |
356
+ | :--- | :--- | :--- |
357
+ | `HIDREAM_AGENT_MODEL` | Local Β· Gemma backend | Path or HF repo id of the local Gemma weights. |
358
+ | `OPENAI_BASE_URL` | OpenAI-compatible API backend | Default base URL pre-filled in the UI. |
359
+ | `OPENAI_API_KEY` | OpenAI-compatible API backend | Default API key pre-filled in the UI. |
360
+ | `OPENAI_MODEL` | OpenAI-compatible API backend | Default model name pre-filled in the UI. |
361
 
362
  ### Prompt Agent in the UI
363
 
364
+ The sidebar contains a Prompt Agent panel that calls the same Reasoning-Driven Prompt Agent used by `prompt_agent.py`. Select either the *OpenAI-compatible API* backend (any endpoint, key, and model name) or the *Local Β· Gemma* backend (set `HIDREAM_AGENT_MODEL` in `.env` or the environment to point to your local Gemma-4-31B-it weights).
365
+
366
+ ### Editing Scheduler (Dev model only)
367
+
368
+ When the server is launched with `--model_type dev`, the **Edit** tab exposes a *Scheduler* dropdown with two options: `flow_match` (default) and `flash`. The selector is hidden for the `full` model and for the Text β†’ Image / Subject tabs, where the scheduler is fixed.
369
 
370
  ## License
371
  The code in this repository and the HiDream-O1-Image models are licensed under MIT License.