yuhangzang commited on
Commit
2c90404
Β·
verified Β·
1 Parent(s): 6f53247

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +101 -0
README.md CHANGED
@@ -2,4 +2,105 @@
2
  license: other
3
  license_name: flux-non-commercial-license
4
  license_link: LICENSE
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  ---
 
2
  license: other
3
  license_name: flux-non-commercial-license
4
  license_link: LICENSE
5
+ language:
6
+ - en
7
+ base_model:
8
+ - FLUX.2-klein-base-9B
9
+ pipeline_tag: image-to-image
10
+ tags:
11
+ - image-generation
12
+ - image-editing
13
+ - flux
14
+ - diffusion-single-file
15
+ library_name: diffusers
16
+ ---
17
+
18
+ # ETCHR-FLUX.2-klein-9B
19
+ <p align="center" style="font-size: 1.2em; margin-top: 0.5em">
20
+ πŸ“–<a href="https://arxiv.org/abs/">Paper</a>
21
+ | 🏠<a href="https://github.com/InternLM/ETCHR">Homepage</a >
22
+ | πŸ€—<a href="https://huggingface.co/internlm/ETCHR-FLUX.2-klein-9B">ETCHR-FLUX.2-klein-9B Model</a >
23
+ | πŸ€—<a href="https://huggingface.co/datasets/internlm/ETCHR-SFT-400K">ETCHR SFT-400K Dataset</a >
24
+ | πŸ€—<a href="https://huggingface.co/datasets/internlm/ETCHR-GRPO-10K">ETCHR GRPO-10K Dataset</a >
25
+ | πŸ€—<a href="https://huggingface.co/datasets/internlm/DL3DV-2k">DL3DV-2K Benchmark</a >
26
+ </p >
27
+ ETCHR-FLUX.2-klein-9B is a novel question-conditioned, reasoning-aware image editor designed to serve as a decoupled visual reasoning assistant for Multimodal Large Language Models. By decoupling the specialized image editor from the downstream understanding model, ETCHR bridges the critical bottleneck where a purely textual chain of thought fails in fine-grained focus or complex spatial transformations.
28
+
29
+
30
+ ## πŸ“’ News
31
+ - πŸš€ [2026/05/22] We have released the training and evaluation code of ETCHR.
32
+ - πŸš€ [2026/05/21] We have released the [ETCHR-FLUX.2-klein-9B Model](https://huggingface.co/internlm/ETCHR-FLUX.2-klein-9B), [ETCHR-SFT-400K Dataset](https://huggingface.co/datasets/internlm/ETCHR-SFT-400K) and [ETCHR GRPO-10K Dataset](https://huggingface.co/datasets/internlm/ETCHR-GRPO-10K).
33
+
34
+
35
+ ## 🌈 Overview
36
+ We are thrilled to introduce ETCHR (Editing To Clarify and Harness Reasoning), a novel question-conditioned, reasoning-aware image editor built on [FLUX.2-klein-base-9B](https://huggingface.co/black-forest-labs/FLUX.2-klein-base-9B) designed to serve as a decoupled visual reasoning assistant for Multimodal Large Language Models (MLLMs).
37
+ By decoupling the specialized image editor from the downstream understanding model, ETCHR bridges the critical bottleneck where a purely textual chain of thought fails in fine-grained focus or complex spatial transformations.
38
+
39
+ </p>
40
+ <p style="text-align: center;">
41
+ <img src="assets/overview.png" alt="Teaser" width="100%">
42
+ </p>
43
+
44
+
45
+
46
+ ## πŸ’‘ Highlights
47
+ - πŸ”₯ **Decoupled & Plug-and-Play:** ETCHR functions as a separate module, allowing it to assist diverse downstream MLLMs (such as Qwen3-VL-8B, Gemini-3.1-Flash-Lite, or Kimi K2.5) without requiring any task-specific fine-tuning on the understanding models themselves.
48
+ - πŸ”₯ **Naturally Reflective Pipeline:** Introduces an Edit-Verify-Reason inference mechanism where the understanding model filters out noisy or flawed edits, reverting safely to the original image when verification fails.
49
+
50
+ ## πŸ“Š Results
51
+ We evaluate ETCHR across five distinct task families spanning fine-grained perception, chart understanding, logic reasoning, jigsaw restoration, and 3D understanding. Across all evaluated backbones, ETCHR consistently yields major improvements in Pass@1 accuracy:
52
+ <p style="text-align: center;">
53
+ <img src="assets/result.png" alt="Pipeline" width="100%">
54
+ </p>
55
+
56
+
57
+ ## πŸ› οΈ Evaluation
58
+ Prepare your environment:
59
+ ```bash
60
+ git clone https://github.com/InternLM/ETCHR.git
61
+ conda create -n ETCHR python==3.11
62
+ conda activate ETCHR
63
+ cd RL/Pref-GRPO
64
+ bash env_setup.sh fastvideo
65
+ pip install "vllm>=0.11.0"
66
+ pip install qwen-vl-utils==0.0.14
67
+ ```
68
+
69
+ We Provide an example code running ETCHR on [DL3DV-2K Benchmark](https://huggingface.co/datasets/internlm/DL3DV-2k) in ```[Evaluation/inference_dl3dv.py](https://github.com/InternLM/ETCHR/blob/master/Evaluation/inference_dl3dv.py)```, you can start the evaluation with the following two steps:
70
+
71
+ **Step 1:** start a VLLM server for an understanding model (eg. Qwen3-VL-8B, Kimi K2.5, ...).
72
+ ```bash
73
+ cd Evaluation
74
+ bash launch_vllm.sh
75
+ ```
76
+
77
+ **Step 2:** Run ETCHR atop any understanding model
78
+ ```bash
79
+ python inference_dl3dv.py
80
+ ```
81
+
82
+ ## Cases
83
+ ETCHR can assist with a broad spectrum of understanding tasks, including fine-grained perception, chart reasoning, maze navigation, jigsaw puzzles, and 3D spatial understanding.
84
+
85
+ <p style="text-align: center;">
86
+ <img src="assets/case-3D.png" alt="case3D" width="100%">
87
+ </p>
88
+ <p style="text-align: center;">
89
+ <img src="assets/case-jigsaw.png" alt="casejigsaw" width="100%">
90
+ </p>
91
+ <p style="text-align: center;">
92
+ <img src="assets/case-maze.png" alt="casejigsaw" width="100%">
93
+ </p>
94
+ <p style="text-align: center;">
95
+ <img src="assets/case-chart.png" alt="casejigsaw" width="100%">
96
+ </p>
97
+
98
+
99
+ ## πŸ“„ License
100
+ Our work is based on [FLUX.2-klein-base-9B](https://huggingface.co/black-forest-labs/FLUX.2-klein-base-9B), so please follow [FLUX Non-Commercial License](https://github.com/black-forest-labs/flux2/blob/main/model_licenses/LICENSE-FLUX-NON-COMMERICAL).
101
+
102
+ ## βœ’οΈCitation
103
+ If you find this project useful, please kindly cite:
104
+ ```
105
+ ```
106
  ---