shiyi0408
/

Meta-CoT

Model card Files Files and versions

Meta-CoT / README.md

nielsr's picture

nielsr HF Staff

Add model card for Meta-CoT

72d6173 verified 10 days ago

|

2.66 kB

	---
	pipeline_tag: image-to-image
	---

	# Meta-CoT: Enhancing Granularity and Generalization in Image Editing

	[Meta-CoT](https://shiyi-zh0408.github.io/projectpages/Meta-CoT/) is a two-level Chain-of-Thought (CoT) decomposition paradigm for image editing. It decomposes editing intentions into a (task, target, understanding ability) triplet and further breaks down tasks into five fundamental meta-tasks, enabling strong generalization across 21+ editing operations.

	[Project Page](https://shiyi-zh0408.github.io/projectpages/Meta-CoT/) \| [Paper](https://huggingface.co/papers/2604.24625) \| [Code](https://github.com/shiyi-zh0408/Meta-CoT)

	## Overview

	Meta-CoT addresses the challenge of understanding granularity and generalization in image editing through:

	- Triplet Decomposition: Decomposes any editing intention into a (task, target, required understanding ability) triplet. This helps the model learn specific elements of an operation during training.
	- Meta-task Generalization: Breaks down complex editing tasks into five fundamental meta-tasks: Addition, Deletion, Replacement, Camera Motion, and Position Change. Training on these meta-tasks allows the model to generalize to unseen, diverse editing scenarios.
	- CoT-Editing Consistency (CEC) Reward: A VLM-based reward mechanism integrated into a Flow-GRPO framework that ensures the model's output aligns accurately with its reasoning process.

	## Usage

	To use Meta-CoT for single-image editing, first set up the environment as described in the [GitHub repository](https://github.com/shiyi-zh0408/Meta-CoT). You can then run the following inference script:

	```bash
	python inference/edit_single.py --image <your-image-path> --instruction <editing-instruction>
	```

	### Key Inference Parameters

	\| Parameter \| Description \| Typical Range \|
	\|-----------\|-------------\|---------------\|
	\| `cfg_text_scale` \| Text prompt guidance strength \| 4.0 - 8.0 \|
	\| `cfg_image_scale` \| Input image preservation strength \| 1.0 - 2.0 \|
	\| `num_timesteps` \| Total denoising steps \| 50 \|

	## Citation

	If you find Meta-CoT useful in your research, please cite:

	```bibtex
	@article{zhang2026metacot,
	title={Meta-CoT: Enhancing Granularity and Generalization in Image Editing},
	author={Zhang, Shiyi and Cheng, Yiji and Hang, Tiankai and Yin, Zijin and He, Runze and Xu, Yu and Dai, Wenxun and Lin, Yunlong and Wang, Chunyu and Lu, Qinglin and Tang, Yansong},
	journal={arXiv preprint arXiv:2604.24625},
	year={2026}
	}
	```

	## Acknowledgments

	The code for Meta-CoT is built upon [Bagel](https://github.com/ByteDance-Seed/Bagel). We thank the authors for their open-source contributions.