Ephemeral182 commited on
Commit
432325a
Β·
verified Β·
1 Parent(s): e61d1d5

README: simplify to demo-focused card; remove venue mention, training recipe, acknowledgements, internal SDL schema

Browse files
Files changed (1) hide show
  1. README.md +34 -99
README.md CHANGED
@@ -44,15 +44,10 @@ datasets:
44
  <img alt="vllm" src="https://img.shields.io/badge/vLLM-0.11-30A14E">
45
  <img alt="cuda" src="https://img.shields.io/badge/CUDA-12.x-76B900?logo=nvidia&logoColor=white">
46
  <img alt="license" src="https://img.shields.io/badge/license-Apache%202.0-green">
47
- <img alt="status" src="https://img.shields.io/badge/status-active-brightgreen">
48
  </p>
49
 
50
  </div>
51
 
52
- > **GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation**
53
- > Sixiang Chen, Zhaohu Xing, Tian Ye, Xinyu Geng, Yunlong Lin, Jianyu Lai, Xuanhua He, Fuxiang Zhai, Jialin Gao, Lei Zhu
54
- > *Submitted to NeurIPS 2026*
55
-
56
  This repository hosts the **GenEvolve agent policy** β€” a Qwen3-VL-8B-Instruct backbone fine-tuned and self-evolved into a tool-orchestrated image-generation agent. Given a user request, the agent issues web/image searches, retrieves visual references, activates internal generation knowledge, and emits an executable **prompt-reference program** `z = (gen_prompt, reference_images)` that drives any reference-conditioned downstream generator (Qwen-Image-Edit, Nano Banana Pro, ...).
57
 
58
  <div align="center">
@@ -64,18 +59,15 @@ This repository hosts the **GenEvolve agent policy** β€” a Qwen3-VL-8B-Instruct
64
 
65
  ---
66
 
67
- ## ✨ TL;DR
68
 
69
  - **Tool-orchestrated trajectories.** The agent calls `search`, `image_search`, and `query_knowledge` (8 callable generation skills) before producing a final program `z = (gen_prompt, reference_images)`.
70
- - **Self-evolution = GRPO + Visual Experience Distillation.** Best-vs-worst trajectory pairs are summarized into a *decision guide* (retrieval-key + 6 imperative bullet lists). The teacher view sees the retrieved guide, the student does not β€” SDL distills the teacher's token-level preferences back into the deployed student. **No runtime memory at inference.**
71
- - **Generator-transferable.** The same trained policy improves both an open-source generator (Qwen-Image-Edit-2511, KScore 0.299 β†’ 0.366) and a strong proprietary generator (Nano Banana Pro, 0.530 β†’ **0.574**).
72
- - **Strong external generalization.** Achieves **0.82** WiScore on the WISE knowledge-intensive benchmark, beating GPT-4o (0.80) and all agentic baselines.
73
-
74
- ---
75
 
76
  ## πŸ“Š Headline Results
77
 
78
- ### GenEvolve-Bench (KScore on the held-out split)
79
 
80
  | Method | Generator | KScore | Knowledge-Anch. | Quality-Anch. |
81
  |---|---|---:|---:|---:|
@@ -95,23 +87,13 @@ This repository hosts the **GenEvolve agent policy** β€” a Qwen3-VL-8B-Instruct
95
  | Mind-Brush | 0.83 | 0.69 | 0.84 | 0.71 | **0.85** | 0.68 | 0.78 |
96
  | **GenEvolve + Qwen-Image-Edit** | **0.84** | 0.74 | 0.87 | **0.83** | 0.81 | **0.83** | **0.82** |
97
 
98
- <div align="center">
99
- <img src="assets/visual_comparison.png" alt="Visual comparison vs strong baselines" width="100%">
100
-
101
- <p><em>Visual comparison on representative GenEvolve-Bench cases; <span style="color:#ea580c">orange</span> marks external/uncommon knowledge; <span style="color:#1f6feb">blue</span> marks internal generation-knowledge requirements.</em></p>
102
- </div>
103
-
104
  ---
105
 
106
  ## 🧠 Method Overview
107
 
108
  <p align="center"><img src="assets/overview.png" alt="GenEvolve method overview" width="92%"></p>
109
 
110
- For a user request $x$, the agent samples a multi-turn trajectory
111
-
112
- $$\tau = (a_1, o_1, \ldots, a_T, o_T, z), \qquad z = (g, R),$$
113
-
114
- where each $a_t$ is one of three actions and $o_t$ is the corresponding observation. The downstream generator renders $\hat{y} = G(g, R)$.
115
 
116
  | Tool | Role | Output |
117
  |---|---|---|
@@ -119,31 +101,31 @@ where each $a_t$ is one of three actions and $o_t$ is the corresponding observat
119
  | `image_search(query)` | Visual references; each result is given a unique `IMG_###` id | Image list with local paths |
120
  | `query_knowledge(skill_name)` | Internal generation knowledge β€” `spatial_layout`, `text_rendering`, `quantity_counting`, `attribute_binding`, `anatomy_body_coherence`, `physical_material_consistency`, `creative_drawing`, `aesthetic_drawing` | Skill markdown |
121
 
122
- **Self-evolution (training-only).** For each prompt the agent samples 6 rollouts. The best/worst pair (with a sufficient reward gap) is summarized by a Gemini-3.1-Pro judge into a single bundle:
123
 
124
- ```
125
- retrieval_key: { trigger, source_prompt_summary }
126
- decision_guidance:
127
- decision_focus
128
- recommended_tool_plan (1–4 imperative bullets)
129
- search_query_guidance (1–3 bullets)
130
- skill_routing_guidance (1–4 bullets)
131
- reference_selection_guidance (1–3 bullets)
132
- prompt_program_guidance (1–3 bullets)
133
- failure_guards (1–3 bullets)
134
- ```
135
 
136
- Bundles are stored in a 500-entry rolling buffer keyed by `embed(trigger + source_prompt_summary)` (Qwen3-Embedding-0.6B) with a cosine retrieval gate of `0.84`. **Only the privileged teacher branch sees the retrieved guide** β€” the student is regularised toward that teacher with an importance-weighted reverse-KL on the same on-policy tokens (see paper Sec. 5 for the exact loss).
137
 
138
- ---
139
 
140
- ## πŸš€ Quick Start
141
 
142
- The deployed checkpoint is the **student policy** β€” it consumes a user prompt and returns a JSON `gen_prompt + reference_images` program through a normal `<think>/<tool_call>/<answer>` loop.
143
 
144
- ### Option 1 β€” full GenEvolve runtime (recommended)
145
 
146
- The end-to-end runtime (vLLM/SGLang server + agent loop + tools + Qwen/Nano renderers) lives in the [GitHub repo](https://github.com/Ephemeral182/GenEvolve).
 
 
 
 
 
 
 
 
 
 
147
 
148
  ```bash
149
  git clone https://github.com/Ephemeral182/GenEvolve.git
@@ -152,10 +134,10 @@ conda create -n genevolve python=3.11 -y && conda activate genevolve
152
  pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
153
  pip install --no-build-isolation -r requirements.txt && pip install -e .
154
 
155
- # serve the policy (TP/DP knobs scale across GPUs)
156
  MODEL_PATH=MeiGen-AI/GenEvolve PORT=8000 TP=1 DP=8 bash scripts/serve_vllm.sh
157
 
158
- # end-to-end example
159
  export SERPER_API_KEY=<your_key> # required for search / image_search
160
  export GOOGLE_API_KEY=<your_key> # only for the Nano Banana Pro backend
161
  python examples/quickstart.py \
@@ -166,42 +148,18 @@ python examples/quickstart.py \
166
  --output paris.png
167
  ```
168
 
169
- ### Option 2 β€” direct Transformers loading
170
-
171
- ```python
172
- from transformers import AutoModelForCausalLM, AutoProcessor
173
- import torch
174
-
175
- repo = "MeiGen-AI/GenEvolve"
176
- model = AutoModelForCausalLM.from_pretrained(
177
- repo, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True,
178
- )
179
- processor = AutoProcessor.from_pretrained(repo, trust_remote_code=True)
180
-
181
- messages = [
182
- {"role": "system", "content": SYSTEM_PROMPT}, # see GitHub repo
183
- {"role": "user", "content": "A vintage diner sign that says 'BLUE SKY DINER' in red neon."},
184
- ]
185
- prompt_ids = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
186
- out = model.generate(prompt_ids, max_new_tokens=4096, temperature=0.7, top_p=0.95)
187
- print(processor.decode(out[0], skip_special_tokens=True))
188
- ```
189
-
190
- ### Final-answer JSON
191
 
192
  ```json
193
  {
194
  "gen_prompt": "...natural-language prompt that refers to images by 'the first reference image', ...",
195
  "reference_images": [
196
- {"img_id": "IMG_001", "note": "what to copy from this image"},
197
- {"img_id": "IMG_004", "note": "what to copy from this image"}
198
  ]
199
  }
200
  ```
201
 
202
- `gen_prompt` MUST refer to selected images using ordinal phrases (`"the first reference image"`) β€” never raw `IMG_###` ids or URLs. `reference_images` is sorted by `img_id` ascending so that ordinals resolve unambiguously.
203
-
204
- Pass `(gen_prompt, [r["local_path"] for r in reference_images])` to your favourite reference-conditioned generator (Qwen-Image-Edit, Nano Banana Pro, ...) to obtain the final image.
205
 
206
  ---
207
 
@@ -217,45 +175,22 @@ Pass `(gen_prompt, [r["local_path"] for r in reference_images])` to your favouri
217
 
218
  ---
219
 
220
- ## 🧾 Training Recipe
221
-
222
- | Stage | Recipe |
223
- |---|---|
224
- | **SFT cold start** | LLaMA-Factory, 2 epochs, 16 GPUs, micro-bsz=2, lr=1e-5 (cosine, warmup 0.02), bf16 + FlashAttention-2, ZeRO-3, vision tower frozen. |
225
- | **Self-evolution** | rLLM/verl, GRPO + experience-conditioned SDL on 8 prompts Γ— 6 rollouts/step, lr=1e-6, Ξ΅_β„“=0.20, Ξ΅_h=0.28, 5 epochs over the RL split. |
226
- | **Reward** | KScore image judge (Faithfulness 0.1 / Visual 0.4 / Text 0.4 / Aesthetics 0.1, Gemini 3.1 Pro Preview) + program-sufficiency text judge, weighted 0.5 / 0.5. |
227
- | **SDL** | λ_SDL = 2.0, decision-only mask (`<tool_call>`/`<answer>`), top-10% logp-delta filter (`SDL_TOP_K_FRAC=0.1`), IS-cap ρ_max = 2, per-token clip disabled, `seq-mean-token-sum` aggregation. |
228
- | **Visual experience memory** | 1 bundle / comparison (decision guide); cosine retrieval gate β‰₯ 0.84; buffer cap 500; Qwen3-Embedding-0.6B keys; teacher-only (no inference-time memory). |
229
-
230
- Full hyper-parameters and ablations are in the appendix tables of the paper.
231
-
232
- ---
233
-
234
  ## βš–οΈ Intended Use, Limits, Bias
235
 
236
  - **Intended use.** Research on tool-using image-generation agents, agentic prompt-program synthesis, and self-distillation from generated outcomes.
237
  - **Out of scope.** The model produces a *prompt + reference list*, not pixels. Final image quality and safety are inherited from the downstream generator you pair it with. Do not use the agent to fabricate likenesses, infringing logos, or misleading factual imagery β€” apply your own content-safety filter on the generator side.
238
  - **Search dependency.** The agent issues live web/image queries through user-provided tool wrappers. Quality of grounded facts depends on the search backend you plug in.
239
- - **Bias.** Tool outputs and reference images come from public web search, which carries demographic, cultural, and geographic biases. The reward judges (Gemini 3.1 Pro Preview) are themselves models with their own biases, which may shape the post-RL policy.
240
 
241
  ---
242
 
243
  ## πŸ“‘ Citation
244
 
245
  ```bibtex
246
- @inproceedings{chen2026genevolve,
247
- title = {GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation},
248
- author = {Chen, Sixiang and Xing, Zhaohu and Ye, Tian and Geng, Xinyu and Lin, Yunlong
249
- and Lai, Jianyu and He, Xuanhua and Zhai, Fuxiang and Gao, Jialin and Zhu, Lei},
250
- booktitle = {Submitted to Conference on Neural Information Processing Systems (NeurIPS)},
251
- year = {2026}
252
  }
253
  ```
254
-
255
- ---
256
-
257
- ## 🀝 Acknowledgements
258
-
259
- We thank the Qwen, Gemini, FLUX, Z-Image, and BAGEL teams for the underlying generators we evaluate against, and the Skill-SD / Gen-Searcher / KnowGen / WISE authors for the open recipes and benchmarks our work builds on.
260
-
261
- For questions or collaboration, please reach out to [Sixiang Chen](mailto:ephemeral182@gmail.com) or open an issue on the [GitHub repo](https://github.com/Ephemeral182/GenEvolve/issues).
 
44
  <img alt="vllm" src="https://img.shields.io/badge/vLLM-0.11-30A14E">
45
  <img alt="cuda" src="https://img.shields.io/badge/CUDA-12.x-76B900?logo=nvidia&logoColor=white">
46
  <img alt="license" src="https://img.shields.io/badge/license-Apache%202.0-green">
 
47
  </p>
48
 
49
  </div>
50
 
 
 
 
 
51
  This repository hosts the **GenEvolve agent policy** β€” a Qwen3-VL-8B-Instruct backbone fine-tuned and self-evolved into a tool-orchestrated image-generation agent. Given a user request, the agent issues web/image searches, retrieves visual references, activates internal generation knowledge, and emits an executable **prompt-reference program** `z = (gen_prompt, reference_images)` that drives any reference-conditioned downstream generator (Qwen-Image-Edit, Nano Banana Pro, ...).
52
 
53
  <div align="center">
 
59
 
60
  ---
61
 
62
+ ## ✨ Highlights
63
 
64
  - **Tool-orchestrated trajectories.** The agent calls `search`, `image_search`, and `query_knowledge` (8 callable generation skills) before producing a final program `z = (gen_prompt, reference_images)`.
65
+ - **Self-evolution with Visual Experience Distillation.** Best-vs-worst trajectory pairs are distilled token-level into the deployed student. **No runtime memory at inference.**
66
+ - **Generator-transferable.** The same trained policy works with both an open-source generator (Qwen-Image-Edit-2511) and a strong proprietary generator (Nano Banana Pro).
 
 
 
67
 
68
  ## πŸ“Š Headline Results
69
 
70
+ ### GenEvolve-Bench (KScore, held-out split)
71
 
72
  | Method | Generator | KScore | Knowledge-Anch. | Quality-Anch. |
73
  |---|---|---:|---:|---:|
 
87
  | Mind-Brush | 0.83 | 0.69 | 0.84 | 0.71 | **0.85** | 0.68 | 0.78 |
88
  | **GenEvolve + Qwen-Image-Edit** | **0.84** | 0.74 | 0.87 | **0.83** | 0.81 | **0.83** | **0.82** |
89
 
 
 
 
 
 
 
90
  ---
91
 
92
  ## 🧠 Method Overview
93
 
94
  <p align="center"><img src="assets/overview.png" alt="GenEvolve method overview" width="92%"></p>
95
 
96
+ For a user request, the agent samples a multi-turn trajectory of tool calls before emitting the final prompt-reference program. The downstream generator then renders the image.
 
 
 
 
97
 
98
  | Tool | Role | Output |
99
  |---|---|---|
 
101
  | `image_search(query)` | Visual references; each result is given a unique `IMG_###` id | Image list with local paths |
102
  | `query_knowledge(skill_name)` | Internal generation knowledge β€” `spatial_layout`, `text_rendering`, `quantity_counting`, `attribute_binding`, `anatomy_body_coherence`, `physical_material_consistency`, `creative_drawing`, `aesthetic_drawing` | Skill markdown |
103
 
104
+ ---
105
 
106
+ ## πŸ–ΌοΈ Visual Demos
 
 
 
 
 
 
 
 
 
 
107
 
108
+ <p align="center"><img src="assets/visual_comparison.png" alt="Qualitative comparison" width="100%"></p>
109
 
110
+ <p align="center"><sub>Qualitative comparison on representative cases. <span style="color:#D97706">Orange</span> marks external/uncommon knowledge requirements; <span style="color:#2563EB">blue</span> marks internal generation-knowledge requirements.</sub></p>
111
 
112
+ ### 🎨 Gallery β€” paired with Nano Banana Pro
113
 
114
+ <p align="center"><img src="assets/gallery_nano.jpg" alt="GenEvolve + Nano Banana Pro gallery" width="100%"></p>
115
 
116
+ <p align="center"><sub>The same agent policy with Nano Banana Pro as the downstream renderer. Examples cover spatial layout, text rendering, quantity counting, attribute binding, anatomy/pose, creative transfer, material physics, and aesthetic drawing.</sub></p>
117
 
118
+ ### 🎨 Gallery β€” paired with Qwen-Image-Edit (open)
119
+
120
+ <p align="center"><img src="assets/gallery_qwen.jpg" alt="GenEvolve + Qwen-Image-Edit gallery" width="100%"></p>
121
+
122
+ <p align="center"><sub>Same trained policy paired with the open-source Qwen-Image-Edit-2511 renderer; consistent quality across both generators reflects generator-transferable orchestration.</sub></p>
123
+
124
+ ---
125
+
126
+ ## πŸš€ Quick Start
127
+
128
+ The deployed checkpoint is the **student policy** β€” it consumes a user prompt and returns a JSON `gen_prompt + reference_images` program through a `<think>/<tool_call>/<answer>` loop. The end-to-end runtime (vLLM/SGLang server + agent loop + tools + Qwen/Nano renderers) lives in the [GitHub repo](https://github.com/Ephemeral182/GenEvolve).
129
 
130
  ```bash
131
  git clone https://github.com/Ephemeral182/GenEvolve.git
 
134
  pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
135
  pip install --no-build-isolation -r requirements.txt && pip install -e .
136
 
137
+ # Serve the policy (TP/DP knobs scale across GPUs)
138
  MODEL_PATH=MeiGen-AI/GenEvolve PORT=8000 TP=1 DP=8 bash scripts/serve_vllm.sh
139
 
140
+ # End-to-end example (Nano backend)
141
  export SERPER_API_KEY=<your_key> # required for search / image_search
142
  export GOOGLE_API_KEY=<your_key> # only for the Nano Banana Pro backend
143
  python examples/quickstart.py \
 
148
  --output paris.png
149
  ```
150
 
151
+ The agent's final `<answer>` is a JSON object:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
152
 
153
  ```json
154
  {
155
  "gen_prompt": "...natural-language prompt that refers to images by 'the first reference image', ...",
156
  "reference_images": [
157
+ {"img_id": "IMG_001", "note": "what to copy from this image"}
 
158
  ]
159
  }
160
  ```
161
 
162
+ `gen_prompt` MUST refer to selected images using ordinal phrases (`"the first reference image"`) β€” never raw `IMG_###` ids or URLs. Pass `(gen_prompt, [r["local_path"] for r in reference_images])` to your favourite reference-conditioned generator (Qwen-Image-Edit, Nano Banana Pro, ...) to obtain the final image.
 
 
163
 
164
  ---
165
 
 
175
 
176
  ---
177
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
178
  ## βš–οΈ Intended Use, Limits, Bias
179
 
180
  - **Intended use.** Research on tool-using image-generation agents, agentic prompt-program synthesis, and self-distillation from generated outcomes.
181
  - **Out of scope.** The model produces a *prompt + reference list*, not pixels. Final image quality and safety are inherited from the downstream generator you pair it with. Do not use the agent to fabricate likenesses, infringing logos, or misleading factual imagery β€” apply your own content-safety filter on the generator side.
182
  - **Search dependency.** The agent issues live web/image queries through user-provided tool wrappers. Quality of grounded facts depends on the search backend you plug in.
183
+ - **Bias.** Tool outputs and reference images come from public web search, which carries demographic, cultural, and geographic biases that may be reflected in agent outputs.
184
 
185
  ---
186
 
187
  ## πŸ“‘ Citation
188
 
189
  ```bibtex
190
+ @article{chen2026genevolve,
191
+ title = {GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation},
192
+ author = {Chen, Sixiang and Xing, Zhaohu and Ye, Tian and Geng, Xinyu and Lin, Yunlong
193
+ and Lai, Jianyu and He, Xuanhua and Zhai, Fuxiang and Gao, Jialin and Zhu, Lei},
194
+ year = {2026}
 
195
  }
196
  ```