Image-Text-to-Text
Transformers
Safetensors
English
qwen3_vl
agent
image-generation
tool-use
visual-reasoning
self-distillation
grpo
reinforcement-learning
multimodal
qwen3-vl
conversational
Instructions to use MeiGen-AI/GenEvolve with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use MeiGen-AI/GenEvolve with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="MeiGen-AI/GenEvolve") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("MeiGen-AI/GenEvolve") model = AutoModelForImageTextToText.from_pretrained("MeiGen-AI/GenEvolve") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use MeiGen-AI/GenEvolve with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "MeiGen-AI/GenEvolve" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MeiGen-AI/GenEvolve", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/MeiGen-AI/GenEvolve
- SGLang
How to use MeiGen-AI/GenEvolve with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "MeiGen-AI/GenEvolve" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MeiGen-AI/GenEvolve", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "MeiGen-AI/GenEvolve" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MeiGen-AI/GenEvolve", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use MeiGen-AI/GenEvolve with Docker Model Runner:
docker model run hf.co/MeiGen-AI/GenEvolve
README: simplify to demo-focused card; remove venue mention, training recipe, acknowledgements, internal SDL schema
Browse files
README.md
CHANGED
|
@@ -44,15 +44,10 @@ datasets:
|
|
| 44 |
<img alt="vllm" src="https://img.shields.io/badge/vLLM-0.11-30A14E">
|
| 45 |
<img alt="cuda" src="https://img.shields.io/badge/CUDA-12.x-76B900?logo=nvidia&logoColor=white">
|
| 46 |
<img alt="license" src="https://img.shields.io/badge/license-Apache%202.0-green">
|
| 47 |
-
<img alt="status" src="https://img.shields.io/badge/status-active-brightgreen">
|
| 48 |
</p>
|
| 49 |
|
| 50 |
</div>
|
| 51 |
|
| 52 |
-
> **GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation**
|
| 53 |
-
> Sixiang Chen, Zhaohu Xing, Tian Ye, Xinyu Geng, Yunlong Lin, Jianyu Lai, Xuanhua He, Fuxiang Zhai, Jialin Gao, Lei Zhu
|
| 54 |
-
> *Submitted to NeurIPS 2026*
|
| 55 |
-
|
| 56 |
This repository hosts the **GenEvolve agent policy** β a Qwen3-VL-8B-Instruct backbone fine-tuned and self-evolved into a tool-orchestrated image-generation agent. Given a user request, the agent issues web/image searches, retrieves visual references, activates internal generation knowledge, and emits an executable **prompt-reference program** `z = (gen_prompt, reference_images)` that drives any reference-conditioned downstream generator (Qwen-Image-Edit, Nano Banana Pro, ...).
|
| 57 |
|
| 58 |
<div align="center">
|
|
@@ -64,18 +59,15 @@ This repository hosts the **GenEvolve agent policy** β a Qwen3-VL-8B-Instruct
|
|
| 64 |
|
| 65 |
---
|
| 66 |
|
| 67 |
-
## β¨
|
| 68 |
|
| 69 |
- **Tool-orchestrated trajectories.** The agent calls `search`, `image_search`, and `query_knowledge` (8 callable generation skills) before producing a final program `z = (gen_prompt, reference_images)`.
|
| 70 |
-
- **Self-evolution
|
| 71 |
-
- **Generator-transferable.** The same trained policy
|
| 72 |
-
- **Strong external generalization.** Achieves **0.82** WiScore on the WISE knowledge-intensive benchmark, beating GPT-4o (0.80) and all agentic baselines.
|
| 73 |
-
|
| 74 |
-
---
|
| 75 |
|
| 76 |
## π Headline Results
|
| 77 |
|
| 78 |
-
### GenEvolve-Bench (KScore
|
| 79 |
|
| 80 |
| Method | Generator | KScore | Knowledge-Anch. | Quality-Anch. |
|
| 81 |
|---|---|---:|---:|---:|
|
|
@@ -95,23 +87,13 @@ This repository hosts the **GenEvolve agent policy** β a Qwen3-VL-8B-Instruct
|
|
| 95 |
| Mind-Brush | 0.83 | 0.69 | 0.84 | 0.71 | **0.85** | 0.68 | 0.78 |
|
| 96 |
| **GenEvolve + Qwen-Image-Edit** | **0.84** | 0.74 | 0.87 | **0.83** | 0.81 | **0.83** | **0.82** |
|
| 97 |
|
| 98 |
-
<div align="center">
|
| 99 |
-
<img src="assets/visual_comparison.png" alt="Visual comparison vs strong baselines" width="100%">
|
| 100 |
-
|
| 101 |
-
<p><em>Visual comparison on representative GenEvolve-Bench cases; <span style="color:#ea580c">orange</span> marks external/uncommon knowledge; <span style="color:#1f6feb">blue</span> marks internal generation-knowledge requirements.</em></p>
|
| 102 |
-
</div>
|
| 103 |
-
|
| 104 |
---
|
| 105 |
|
| 106 |
## π§ Method Overview
|
| 107 |
|
| 108 |
<p align="center"><img src="assets/overview.png" alt="GenEvolve method overview" width="92%"></p>
|
| 109 |
|
| 110 |
-
For a user request
|
| 111 |
-
|
| 112 |
-
$$\tau = (a_1, o_1, \ldots, a_T, o_T, z), \qquad z = (g, R),$$
|
| 113 |
-
|
| 114 |
-
where each $a_t$ is one of three actions and $o_t$ is the corresponding observation. The downstream generator renders $\hat{y} = G(g, R)$.
|
| 115 |
|
| 116 |
| Tool | Role | Output |
|
| 117 |
|---|---|---|
|
|
@@ -119,31 +101,31 @@ where each $a_t$ is one of three actions and $o_t$ is the corresponding observat
|
|
| 119 |
| `image_search(query)` | Visual references; each result is given a unique `IMG_###` id | Image list with local paths |
|
| 120 |
| `query_knowledge(skill_name)` | Internal generation knowledge β `spatial_layout`, `text_rendering`, `quantity_counting`, `attribute_binding`, `anatomy_body_coherence`, `physical_material_consistency`, `creative_drawing`, `aesthetic_drawing` | Skill markdown |
|
| 121 |
|
| 122 |
-
|
| 123 |
|
| 124 |
-
|
| 125 |
-
retrieval_key: { trigger, source_prompt_summary }
|
| 126 |
-
decision_guidance:
|
| 127 |
-
decision_focus
|
| 128 |
-
recommended_tool_plan (1β4 imperative bullets)
|
| 129 |
-
search_query_guidance (1β3 bullets)
|
| 130 |
-
skill_routing_guidance (1β4 bullets)
|
| 131 |
-
reference_selection_guidance (1β3 bullets)
|
| 132 |
-
prompt_program_guidance (1β3 bullets)
|
| 133 |
-
failure_guards (1β3 bullets)
|
| 134 |
-
```
|
| 135 |
|
| 136 |
-
|
| 137 |
|
| 138 |
-
-
|
| 139 |
|
| 140 |
-
##
|
| 141 |
|
| 142 |
-
|
| 143 |
|
| 144 |
-
|
| 145 |
|
| 146 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 147 |
|
| 148 |
```bash
|
| 149 |
git clone https://github.com/Ephemeral182/GenEvolve.git
|
|
@@ -152,10 +134,10 @@ conda create -n genevolve python=3.11 -y && conda activate genevolve
|
|
| 152 |
pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
|
| 153 |
pip install --no-build-isolation -r requirements.txt && pip install -e .
|
| 154 |
|
| 155 |
-
#
|
| 156 |
MODEL_PATH=MeiGen-AI/GenEvolve PORT=8000 TP=1 DP=8 bash scripts/serve_vllm.sh
|
| 157 |
|
| 158 |
-
#
|
| 159 |
export SERPER_API_KEY=<your_key> # required for search / image_search
|
| 160 |
export GOOGLE_API_KEY=<your_key> # only for the Nano Banana Pro backend
|
| 161 |
python examples/quickstart.py \
|
|
@@ -166,42 +148,18 @@ python examples/quickstart.py \
|
|
| 166 |
--output paris.png
|
| 167 |
```
|
| 168 |
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
```python
|
| 172 |
-
from transformers import AutoModelForCausalLM, AutoProcessor
|
| 173 |
-
import torch
|
| 174 |
-
|
| 175 |
-
repo = "MeiGen-AI/GenEvolve"
|
| 176 |
-
model = AutoModelForCausalLM.from_pretrained(
|
| 177 |
-
repo, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True,
|
| 178 |
-
)
|
| 179 |
-
processor = AutoProcessor.from_pretrained(repo, trust_remote_code=True)
|
| 180 |
-
|
| 181 |
-
messages = [
|
| 182 |
-
{"role": "system", "content": SYSTEM_PROMPT}, # see GitHub repo
|
| 183 |
-
{"role": "user", "content": "A vintage diner sign that says 'BLUE SKY DINER' in red neon."},
|
| 184 |
-
]
|
| 185 |
-
prompt_ids = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
|
| 186 |
-
out = model.generate(prompt_ids, max_new_tokens=4096, temperature=0.7, top_p=0.95)
|
| 187 |
-
print(processor.decode(out[0], skip_special_tokens=True))
|
| 188 |
-
```
|
| 189 |
-
|
| 190 |
-
### Final-answer JSON
|
| 191 |
|
| 192 |
```json
|
| 193 |
{
|
| 194 |
"gen_prompt": "...natural-language prompt that refers to images by 'the first reference image', ...",
|
| 195 |
"reference_images": [
|
| 196 |
-
{"img_id": "IMG_001", "note": "what to copy from this image"}
|
| 197 |
-
{"img_id": "IMG_004", "note": "what to copy from this image"}
|
| 198 |
]
|
| 199 |
}
|
| 200 |
```
|
| 201 |
|
| 202 |
-
`gen_prompt` MUST refer to selected images using ordinal phrases (`"the first reference image"`) β never raw `IMG_###` ids or URLs. `reference_images`
|
| 203 |
-
|
| 204 |
-
Pass `(gen_prompt, [r["local_path"] for r in reference_images])` to your favourite reference-conditioned generator (Qwen-Image-Edit, Nano Banana Pro, ...) to obtain the final image.
|
| 205 |
|
| 206 |
---
|
| 207 |
|
|
@@ -217,45 +175,22 @@ Pass `(gen_prompt, [r["local_path"] for r in reference_images])` to your favouri
|
|
| 217 |
|
| 218 |
---
|
| 219 |
|
| 220 |
-
## π§Ύ Training Recipe
|
| 221 |
-
|
| 222 |
-
| Stage | Recipe |
|
| 223 |
-
|---|---|
|
| 224 |
-
| **SFT cold start** | LLaMA-Factory, 2 epochs, 16 GPUs, micro-bsz=2, lr=1e-5 (cosine, warmup 0.02), bf16 + FlashAttention-2, ZeRO-3, vision tower frozen. |
|
| 225 |
-
| **Self-evolution** | rLLM/verl, GRPO + experience-conditioned SDL on 8 prompts Γ 6 rollouts/step, lr=1e-6, Ξ΅_β=0.20, Ξ΅_h=0.28, 5 epochs over the RL split. |
|
| 226 |
-
| **Reward** | KScore image judge (Faithfulness 0.1 / Visual 0.4 / Text 0.4 / Aesthetics 0.1, Gemini 3.1 Pro Preview) + program-sufficiency text judge, weighted 0.5 / 0.5. |
|
| 227 |
-
| **SDL** | Ξ»_SDL = 2.0, decision-only mask (`<tool_call>`/`<answer>`), top-10% logp-delta filter (`SDL_TOP_K_FRAC=0.1`), IS-cap Ο_max = 2, per-token clip disabled, `seq-mean-token-sum` aggregation. |
|
| 228 |
-
| **Visual experience memory** | 1 bundle / comparison (decision guide); cosine retrieval gate β₯ 0.84; buffer cap 500; Qwen3-Embedding-0.6B keys; teacher-only (no inference-time memory). |
|
| 229 |
-
|
| 230 |
-
Full hyper-parameters and ablations are in the appendix tables of the paper.
|
| 231 |
-
|
| 232 |
-
---
|
| 233 |
-
|
| 234 |
## βοΈ Intended Use, Limits, Bias
|
| 235 |
|
| 236 |
- **Intended use.** Research on tool-using image-generation agents, agentic prompt-program synthesis, and self-distillation from generated outcomes.
|
| 237 |
- **Out of scope.** The model produces a *prompt + reference list*, not pixels. Final image quality and safety are inherited from the downstream generator you pair it with. Do not use the agent to fabricate likenesses, infringing logos, or misleading factual imagery β apply your own content-safety filter on the generator side.
|
| 238 |
- **Search dependency.** The agent issues live web/image queries through user-provided tool wrappers. Quality of grounded facts depends on the search backend you plug in.
|
| 239 |
-
- **Bias.** Tool outputs and reference images come from public web search, which carries demographic, cultural, and geographic biases
|
| 240 |
|
| 241 |
---
|
| 242 |
|
| 243 |
## π Citation
|
| 244 |
|
| 245 |
```bibtex
|
| 246 |
-
@
|
| 247 |
-
title
|
| 248 |
-
author
|
| 249 |
-
|
| 250 |
-
|
| 251 |
-
year = {2026}
|
| 252 |
}
|
| 253 |
```
|
| 254 |
-
|
| 255 |
-
---
|
| 256 |
-
|
| 257 |
-
## π€ Acknowledgements
|
| 258 |
-
|
| 259 |
-
We thank the Qwen, Gemini, FLUX, Z-Image, and BAGEL teams for the underlying generators we evaluate against, and the Skill-SD / Gen-Searcher / KnowGen / WISE authors for the open recipes and benchmarks our work builds on.
|
| 260 |
-
|
| 261 |
-
For questions or collaboration, please reach out to [Sixiang Chen](mailto:ephemeral182@gmail.com) or open an issue on the [GitHub repo](https://github.com/Ephemeral182/GenEvolve/issues).
|
|
|
|
| 44 |
<img alt="vllm" src="https://img.shields.io/badge/vLLM-0.11-30A14E">
|
| 45 |
<img alt="cuda" src="https://img.shields.io/badge/CUDA-12.x-76B900?logo=nvidia&logoColor=white">
|
| 46 |
<img alt="license" src="https://img.shields.io/badge/license-Apache%202.0-green">
|
|
|
|
| 47 |
</p>
|
| 48 |
|
| 49 |
</div>
|
| 50 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
This repository hosts the **GenEvolve agent policy** β a Qwen3-VL-8B-Instruct backbone fine-tuned and self-evolved into a tool-orchestrated image-generation agent. Given a user request, the agent issues web/image searches, retrieves visual references, activates internal generation knowledge, and emits an executable **prompt-reference program** `z = (gen_prompt, reference_images)` that drives any reference-conditioned downstream generator (Qwen-Image-Edit, Nano Banana Pro, ...).
|
| 52 |
|
| 53 |
<div align="center">
|
|
|
|
| 59 |
|
| 60 |
---
|
| 61 |
|
| 62 |
+
## β¨ Highlights
|
| 63 |
|
| 64 |
- **Tool-orchestrated trajectories.** The agent calls `search`, `image_search`, and `query_knowledge` (8 callable generation skills) before producing a final program `z = (gen_prompt, reference_images)`.
|
| 65 |
+
- **Self-evolution with Visual Experience Distillation.** Best-vs-worst trajectory pairs are distilled token-level into the deployed student. **No runtime memory at inference.**
|
| 66 |
+
- **Generator-transferable.** The same trained policy works with both an open-source generator (Qwen-Image-Edit-2511) and a strong proprietary generator (Nano Banana Pro).
|
|
|
|
|
|
|
|
|
|
| 67 |
|
| 68 |
## π Headline Results
|
| 69 |
|
| 70 |
+
### GenEvolve-Bench (KScore, held-out split)
|
| 71 |
|
| 72 |
| Method | Generator | KScore | Knowledge-Anch. | Quality-Anch. |
|
| 73 |
|---|---|---:|---:|---:|
|
|
|
|
| 87 |
| Mind-Brush | 0.83 | 0.69 | 0.84 | 0.71 | **0.85** | 0.68 | 0.78 |
|
| 88 |
| **GenEvolve + Qwen-Image-Edit** | **0.84** | 0.74 | 0.87 | **0.83** | 0.81 | **0.83** | **0.82** |
|
| 89 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
---
|
| 91 |
|
| 92 |
## π§ Method Overview
|
| 93 |
|
| 94 |
<p align="center"><img src="assets/overview.png" alt="GenEvolve method overview" width="92%"></p>
|
| 95 |
|
| 96 |
+
For a user request, the agent samples a multi-turn trajectory of tool calls before emitting the final prompt-reference program. The downstream generator then renders the image.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 97 |
|
| 98 |
| Tool | Role | Output |
|
| 99 |
|---|---|---|
|
|
|
|
| 101 |
| `image_search(query)` | Visual references; each result is given a unique `IMG_###` id | Image list with local paths |
|
| 102 |
| `query_knowledge(skill_name)` | Internal generation knowledge β `spatial_layout`, `text_rendering`, `quantity_counting`, `attribute_binding`, `anatomy_body_coherence`, `physical_material_consistency`, `creative_drawing`, `aesthetic_drawing` | Skill markdown |
|
| 103 |
|
| 104 |
+
---
|
| 105 |
|
| 106 |
+
## πΌοΈ Visual Demos
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 107 |
|
| 108 |
+
<p align="center"><img src="assets/visual_comparison.png" alt="Qualitative comparison" width="100%"></p>
|
| 109 |
|
| 110 |
+
<p align="center"><sub>Qualitative comparison on representative cases. <span style="color:#D97706">Orange</span> marks external/uncommon knowledge requirements; <span style="color:#2563EB">blue</span> marks internal generation-knowledge requirements.</sub></p>
|
| 111 |
|
| 112 |
+
### π¨ Gallery β paired with Nano Banana Pro
|
| 113 |
|
| 114 |
+
<p align="center"><img src="assets/gallery_nano.jpg" alt="GenEvolve + Nano Banana Pro gallery" width="100%"></p>
|
| 115 |
|
| 116 |
+
<p align="center"><sub>The same agent policy with Nano Banana Pro as the downstream renderer. Examples cover spatial layout, text rendering, quantity counting, attribute binding, anatomy/pose, creative transfer, material physics, and aesthetic drawing.</sub></p>
|
| 117 |
|
| 118 |
+
### π¨ Gallery β paired with Qwen-Image-Edit (open)
|
| 119 |
+
|
| 120 |
+
<p align="center"><img src="assets/gallery_qwen.jpg" alt="GenEvolve + Qwen-Image-Edit gallery" width="100%"></p>
|
| 121 |
+
|
| 122 |
+
<p align="center"><sub>Same trained policy paired with the open-source Qwen-Image-Edit-2511 renderer; consistent quality across both generators reflects generator-transferable orchestration.</sub></p>
|
| 123 |
+
|
| 124 |
+
---
|
| 125 |
+
|
| 126 |
+
## π Quick Start
|
| 127 |
+
|
| 128 |
+
The deployed checkpoint is the **student policy** β it consumes a user prompt and returns a JSON `gen_prompt + reference_images` program through a `<think>/<tool_call>/<answer>` loop. The end-to-end runtime (vLLM/SGLang server + agent loop + tools + Qwen/Nano renderers) lives in the [GitHub repo](https://github.com/Ephemeral182/GenEvolve).
|
| 129 |
|
| 130 |
```bash
|
| 131 |
git clone https://github.com/Ephemeral182/GenEvolve.git
|
|
|
|
| 134 |
pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
|
| 135 |
pip install --no-build-isolation -r requirements.txt && pip install -e .
|
| 136 |
|
| 137 |
+
# Serve the policy (TP/DP knobs scale across GPUs)
|
| 138 |
MODEL_PATH=MeiGen-AI/GenEvolve PORT=8000 TP=1 DP=8 bash scripts/serve_vllm.sh
|
| 139 |
|
| 140 |
+
# End-to-end example (Nano backend)
|
| 141 |
export SERPER_API_KEY=<your_key> # required for search / image_search
|
| 142 |
export GOOGLE_API_KEY=<your_key> # only for the Nano Banana Pro backend
|
| 143 |
python examples/quickstart.py \
|
|
|
|
| 148 |
--output paris.png
|
| 149 |
```
|
| 150 |
|
| 151 |
+
The agent's final `<answer>` is a JSON object:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 152 |
|
| 153 |
```json
|
| 154 |
{
|
| 155 |
"gen_prompt": "...natural-language prompt that refers to images by 'the first reference image', ...",
|
| 156 |
"reference_images": [
|
| 157 |
+
{"img_id": "IMG_001", "note": "what to copy from this image"}
|
|
|
|
| 158 |
]
|
| 159 |
}
|
| 160 |
```
|
| 161 |
|
| 162 |
+
`gen_prompt` MUST refer to selected images using ordinal phrases (`"the first reference image"`) β never raw `IMG_###` ids or URLs. Pass `(gen_prompt, [r["local_path"] for r in reference_images])` to your favourite reference-conditioned generator (Qwen-Image-Edit, Nano Banana Pro, ...) to obtain the final image.
|
|
|
|
|
|
|
| 163 |
|
| 164 |
---
|
| 165 |
|
|
|
|
| 175 |
|
| 176 |
---
|
| 177 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 178 |
## βοΈ Intended Use, Limits, Bias
|
| 179 |
|
| 180 |
- **Intended use.** Research on tool-using image-generation agents, agentic prompt-program synthesis, and self-distillation from generated outcomes.
|
| 181 |
- **Out of scope.** The model produces a *prompt + reference list*, not pixels. Final image quality and safety are inherited from the downstream generator you pair it with. Do not use the agent to fabricate likenesses, infringing logos, or misleading factual imagery β apply your own content-safety filter on the generator side.
|
| 182 |
- **Search dependency.** The agent issues live web/image queries through user-provided tool wrappers. Quality of grounded facts depends on the search backend you plug in.
|
| 183 |
+
- **Bias.** Tool outputs and reference images come from public web search, which carries demographic, cultural, and geographic biases that may be reflected in agent outputs.
|
| 184 |
|
| 185 |
---
|
| 186 |
|
| 187 |
## π Citation
|
| 188 |
|
| 189 |
```bibtex
|
| 190 |
+
@article{chen2026genevolve,
|
| 191 |
+
title = {GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation},
|
| 192 |
+
author = {Chen, Sixiang and Xing, Zhaohu and Ye, Tian and Geng, Xinyu and Lin, Yunlong
|
| 193 |
+
and Lai, Jianyu and He, Xuanhua and Zhai, Fuxiang and Gao, Jialin and Zhu, Lei},
|
| 194 |
+
year = {2026}
|
|
|
|
| 195 |
}
|
| 196 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|