File size: 5,595 Bytes
99d7503
 
 
 
2c90404
 
 
 
 
 
 
 
 
 
 
 
 
 
ed62caa
482d875
2c90404
 
fac13ec
2c90404
 
ed62caa
2c90404
 
 
 
 
15c68a7
2c90404
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94ff4d3
2c90404
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8b57aea
 
 
 
 
 
2c90404
94ff4d3
 
 
 
 
 
99d7503
94ff4d3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
---
license: other
license_name: flux-non-commercial-license
license_link: LICENSE
language:
  - en
base_model:
  - FLUX.2-klein-base-9B
pipeline_tag: image-to-image
tags:
  - image-generation
  - image-editing
  - flux
  - diffusion-single-file
library_name: diffusers
---

# ETCHR-FLUX.2-klein-9B

  πŸ“–<a href="https://arxiv.org/abs/2605.23897">Paper</a>
  | 🏠<a href="https://github.com/InternLM/ETCHR">Homepage</a >
  | πŸ€—<a href="https://huggingface.co/internlm/ETCHR-FLUX.2-klein-9B">ETCHR-FLUX.2-klein-9B Model</a >
  | πŸ€—<a href="https://huggingface.co/datasets/BeichenZhang/ETCHR-SFT-400K">ETCHR SFT-400K Dataset</a >
  | πŸ€—<a href="https://huggingface.co/datasets/internlm/ETCHR-GRPO-10K">ETCHR GRPO-10K Dataset</a >
  | πŸ€—<a href="https://huggingface.co/datasets/internlm/DL3DV-2k">DL3DV-2K Benchmark</a >

ETCHR-FLUX.2-klein-9B is a novel question-conditioned, reasoning-aware image editor designed to serve as a decoupled visual reasoning assistant for Multimodal Large Language Models. By decoupling the specialized image editor from the downstream understanding model, ETCHR bridges the critical bottleneck where a purely textual chain of thought fails in fine-grained focus or complex spatial transformations.


## πŸ“’ News
- πŸš€ [2026/05/22] We have released the training and evaluation code of ETCHR.
- πŸš€ [2026/05/21] We have released the [ETCHR-FLUX.2-klein-9B Model](https://huggingface.co/internlm/ETCHR-FLUX.2-klein-9B), [ETCHR-SFT-400K Dataset](https://huggingface.co/datasets/BeichenZhang/ETCHR-SFT-400K) and [ETCHR GRPO-10K Dataset](https://huggingface.co/datasets/internlm/ETCHR-GRPO-10K).

  
## 🌈 Overview
We are thrilled to introduce ETCHR (Editing To Clarify and Harness Reasoning), a novel question-conditioned, reasoning-aware image editor built on [FLUX.2-klein-base-9B](https://huggingface.co/black-forest-labs/FLUX.2-klein-base-9B) designed to serve as a decoupled visual reasoning assistant for Multimodal Large Language Models (MLLMs).
By decoupling the specialized image editor from the downstream understanding model, ETCHR bridges the critical bottleneck where a purely textual chain of thought fails in fine-grained focus or complex spatial transformations.

</p>
<p style="text-align: center;"> 
  <img src="assets/overview.png" alt="Teaser" width="100%"> 
</p>



## πŸ’‘ Highlights
- πŸ”₯ **Decoupled & Plug-and-Play:** ETCHR functions as a separate module, allowing it to assist diverse downstream MLLMs (such as Qwen3-VL-8B, Gemini-3.1-Flash-Lite, or Kimi K2.5) without requiring any task-specific fine-tuning on the understanding models themselves.
- πŸ”₯ **Naturally Reflective Pipeline:** Introduces an Edit-Verify-Reason inference mechanism where the understanding model filters out noisy or flawed edits, reverting safely to the original image when verification fails.

## πŸ“Š Results
We evaluate ETCHR across five distinct task families spanning fine-grained perception, chart understanding, logic reasoning, jigsaw restoration, and 3D understanding. Across all evaluated backbones, ETCHR consistently yields major improvements in Pass@1 accuracy:
<p style="text-align: center;"> 
  <img src="assets/result.png" alt="Pipeline" width="100%"> 
</p>


## πŸ› οΈ Evaluation
Prepare your environment:
```bash
git clone https://github.com/InternLM/ETCHR.git
conda create -n ETCHR python==3.11
conda activate ETCHR
cd RL/Pref-GRPO
bash env_setup.sh fastvideo
pip install "vllm>=0.11.0"
pip install qwen-vl-utils==0.0.14
```

We Provide an example code running ETCHR on [DL3DV-2K Benchmark](https://huggingface.co/datasets/internlm/DL3DV-2k) in [Evaluation/inference_dl3dv.py](https://github.com/InternLM/ETCHR/blob/master/Evaluation/inference_dl3dv.py), you can start the evaluation with the following two steps:

**Step 1:** start a VLLM server for an understanding model (eg. Qwen3-VL-8B, Kimi K2.5, ...).
```bash
cd Evaluation
bash launch_vllm.sh
```

**Step 2:** Run ETCHR atop any understanding model
```bash
python inference_dl3dv.py
```

## Cases
ETCHR can assist with a broad spectrum of understanding tasks, including fine-grained perception, chart reasoning, maze navigation, jigsaw puzzles, and 3D spatial understanding.

<p style="text-align: center;"> 
  <img src="assets/case-3D.png" alt="case3D" width="100%"> 
</p>
<p style="text-align: center;"> 
  <img src="assets/case-jigsaw.png" alt="casejigsaw" width="100%"> 
</p>
<p style="text-align: center;"> 
  <img src="assets/case-maze.png" alt="casejigsaw" width="100%"> 
</p>
<p style="text-align: center;"> 
  <img src="assets/case-chart.png" alt="casejigsaw" width="100%"> 
</p>


## πŸ“„ License
Our work is based on [FLUX.2-klein-base-9B](https://huggingface.co/black-forest-labs/FLUX.2-klein-base-9B), so please follow [FLUX Non-Commercial License](https://github.com/black-forest-labs/flux2/blob/main/model_licenses/LICENSE-FLUX-NON-COMMERICAL).

## βœ’οΈCitation
If you find this project useful, please kindly cite:
```
@article{zhang2026etchr,
  title={ETCHR: Editing To Clarify and Harness Reasoning},
  author={Beichen Zhang, Yuhong Liu, Jinsong Li, Yuhang Zang, Jiaqi Wang, Dahua Lin},
  journal={arXiv preprint arXiv:2605.23897},
  year={2026}
}
```
## ❀️ Acknowledgement
The base model is [FLUX.2-klein-base-9B](https://huggingface.co/black-forest-labs/FLUX.2-klein-base-9B), a powerful image-to-image model.

The work is built upon <a href="https://github.com/modelscope/DiffSynth-Studio">DiffSynth-Studio</a > and <a href="https://github.com/CodeGoat24/Pref-GRPO">Pref-GRPO</a >, two excellent codebases for Diffusion models training!


---