Any-to-Any
Transformers
Safetensors
bagel
text-generation
File size: 3,084 Bytes
5718ebb
 
 
 
 
 
 
 
 
 
b4b424f
304c4ad
b4b424f
 
 
304c4ad
5718ebb
 
304c4ad
5718ebb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
304c4ad
5718ebb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
304c4ad
5718ebb
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
---
base_model:
- ByteDance-Seed/BAGEL-7B-MoT
datasets:
- Uni-Edit/Train-Data
library_name: transformers
pipeline_tag: any-to-any
license: apache-2.0
---

<p align="left">
  <img src="https://github.com/zhengdian1/Uni-Edit/blob/main/assets/logo.jpg?raw=true" alt="Uni-Edit" width="480"/>
</p>


# 🥯 Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning


[**Project Page**](https://zhengdian1.github.io/Uni-Edit-proj/) | [**GitHub Repository**](https://github.com/zhengdian1/Uni-Edit) | [**Paper**](https://arxiv.org/pdf/2605.21487)

# 👀 Intro

<div align="center">
  <img src="https://github.com/zhengdian1/Uni-Edit/blob/main/assets/teaser.webp?raw=true" alt="Uni-Edit Teaser" width="80%">
</div>

We introduce **Uni-Edit**, an intelligent image editing task that serves as the **first general task for Unified Multimodal Model (UMM) tuning**. Unlike conventional mixed multi-task training that suffers from inherent task conflicts and requires complex multi-stage pipelines, Uni-Edit breaks this paradigm. It achieves true mutual reinforcement by **improving image understanding, generation, and editing capabilities simultaneously using only one task, one training stage, and one dataset.**

To overcome the limitations of simplistic existing editing data, we propose the **first automated and scalable data synthesis pipeline** for intelligent editing. By transforming diverse VQA data into complex instructions with embedded questions and nested logic, we build **Uni-Edit-148k**, a dedicated dataset pairing reasoning-intensive instructions with high-quality edited images.

Extensive experiments on BAGEL and Janus-Pro demonstrate that tuning solely on Uni-Edit achieves **comprehensive enhancements across all three multimodal capabilities** without requiring any massive data mixing, balancing tricks, or auxiliary operations.

## 🎥 Demo

Refer to our website [[🌐Project Page]](https://zhengdian1.github.io/Uni-Edit-proj/) 

## 🚀 Training and Inference

For detailed instructions on setup, training, inference, evaluation, data construction, please refer to the [official GitHub repository](https://github.com/zhengdian1/Uni-Edit).

**⚠️ IMPORTANT: Custom Architecture**
Because this is a custom architecture, you **CANNOT** load it directly via `AutoModel.from_pretrained()`. To run the provided inference code, you **MUST** physically merge these shards into a single `ema.safetensors` file on your local machine.

Run the Python script in the [code](https://github.com/zhengdian1/Uni-Edit/merge.py) where you downloaded the repository. 
*(Note: You need at least 54GB of free system RAM to perform this merge).*

## 📐 Citation

If you find our work helpful for your research, please consider citing our work:

```bibtex
@article{zheng2026uniedit,
  title   = {Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning},
  author  = {Zheng, Dian and Zhang, Manyuan and Li, Hongyu and Liu, Hongbo and Zou, Kai and Feng, Kaituo and Li, Hongsheng},
  journal = {arXiv preprint arXiv:2605.21487},
  year    = {2026}
}
```