| --- |
| pipeline_tag: image-to-image |
| --- |
| |
| # Meta-CoT: Enhancing Granularity and Generalization in Image Editing |
|
|
| [Meta-CoT](https://shiyi-zh0408.github.io/projectpages/Meta-CoT/) is a two-level Chain-of-Thought (CoT) decomposition paradigm for image editing. It decomposes editing intentions into a *(task, target, understanding ability)* triplet and further breaks down tasks into five fundamental meta-tasks, enabling strong generalization across 21+ editing operations. |
|
|
| [**Project Page**](https://shiyi-zh0408.github.io/projectpages/Meta-CoT/) | [**Paper**](https://huggingface.co/papers/2604.24625) | [**Code**](https://github.com/shiyi-zh0408/Meta-CoT) |
|
|
| ## Overview |
|
|
| Meta-CoT addresses the challenge of understanding granularity and generalization in image editing through: |
|
|
| - **Triplet Decomposition**: Decomposes any editing intention into a (task, target, required understanding ability) triplet. This helps the model learn specific elements of an operation during training. |
| - **Meta-task Generalization**: Breaks down complex editing tasks into five fundamental meta-tasks: Addition, Deletion, Replacement, Camera Motion, and Position Change. Training on these meta-tasks allows the model to generalize to unseen, diverse editing scenarios. |
| - **CoT-Editing Consistency (CEC) Reward**: A VLM-based reward mechanism integrated into a Flow-GRPO framework that ensures the model's output aligns accurately with its reasoning process. |
|
|
| ## Usage |
|
|
| To use Meta-CoT for single-image editing, first set up the environment as described in the [GitHub repository](https://github.com/shiyi-zh0408/Meta-CoT). You can then run the following inference script: |
|
|
| ```bash |
| python inference/edit_single.py --image <your-image-path> --instruction <editing-instruction> |
| ``` |
|
|
| ### Key Inference Parameters |
|
|
| | Parameter | Description | Typical Range | |
| |-----------|-------------|---------------| |
| | `cfg_text_scale` | Text prompt guidance strength | 4.0 - 8.0 | |
| | `cfg_image_scale` | Input image preservation strength | 1.0 - 2.0 | |
| | `num_timesteps` | Total denoising steps | 50 | |
|
|
| ## Citation |
|
|
| If you find Meta-CoT useful in your research, please cite: |
|
|
| ```bibtex |
| @article{zhang2026metacot, |
| title={Meta-CoT: Enhancing Granularity and Generalization in Image Editing}, |
| author={Zhang, Shiyi and Cheng, Yiji and Hang, Tiankai and Yin, Zijin and He, Runze and Xu, Yu and Dai, Wenxun and Lin, Yunlong and Wang, Chunyu and Lu, Qinglin and Tang, Yansong}, |
| journal={arXiv preprint arXiv:2604.24625}, |
| year={2026} |
| } |
| ``` |
|
|
| ## Acknowledgments |
|
|
| The code for Meta-CoT is built upon [Bagel](https://github.com/ByteDance-Seed/Bagel). We thank the authors for their open-source contributions. |