internlm
/

ETCHR-FLUX.2-klein-9B

 license: other
 license_name: flux-non-commercial-license
 license_link: LICENSE
+language:
+  - en
+base_model:
+  - FLUX.2-klein-base-9B
+pipeline_tag: image-to-image
+tags:
+  - image-generation
+  - image-editing
+  - flux
+  - diffusion-single-file
+library_name: diffusers
+---
+# ETCHR-FLUX.2-klein-9B
+  <p align="center" style="font-size: 1.2em; margin-top: 0.5em">
+    📖<a href="https://arxiv.org/abs/">Paper</a>
+  | 🏠<a href="https://github.com/InternLM/ETCHR">Homepage</a >
+  | 🤗<a href="https://huggingface.co/internlm/ETCHR-FLUX.2-klein-9B">ETCHR-FLUX.2-klein-9B Model</a >
+  | 🤗<a href="https://huggingface.co/datasets/internlm/ETCHR-SFT-400K">ETCHR SFT-400K Dataset</a >
+  | 🤗<a href="https://huggingface.co/datasets/internlm/ETCHR-GRPO-10K">ETCHR GRPO-10K Dataset</a >
+  | 🤗<a href="https://huggingface.co/datasets/internlm/DL3DV-2k">DL3DV-2K Benchmark</a >
+  </p >
+ETCHR-FLUX.2-klein-9B is a novel question-conditioned, reasoning-aware image editor designed to serve as a decoupled visual reasoning assistant for Multimodal Large Language Models. By decoupling the specialized image editor from the downstream understanding model, ETCHR bridges the critical bottleneck where a purely textual chain of thought fails in fine-grained focus or complex spatial transformations.
+## 📢 News
+- 🚀 [2026/05/22] We have released the training and evaluation code of ETCHR.
+- 🚀 [2026/05/21] We have released the [ETCHR-FLUX.2-klein-9B Model](https://huggingface.co/internlm/ETCHR-FLUX.2-klein-9B), [ETCHR-SFT-400K Dataset](https://huggingface.co/datasets/internlm/ETCHR-SFT-400K) and [ETCHR GRPO-10K Dataset](https://huggingface.co/datasets/internlm/ETCHR-GRPO-10K).
+## 🌈 Overview
+We are thrilled to introduce ETCHR (Editing To Clarify and Harness Reasoning), a novel question-conditioned, reasoning-aware image editor built on [FLUX.2-klein-base-9B](https://huggingface.co/black-forest-labs/FLUX.2-klein-base-9B) designed to serve as a decoupled visual reasoning assistant for Multimodal Large Language Models (MLLMs).
+By decoupling the specialized image editor from the downstream understanding model, ETCHR bridges the critical bottleneck where a purely textual chain of thought fails in fine-grained focus or complex spatial transformations.
+</p>
+<p style="text-align: center;">
+  <img src="assets/overview.png" alt="Teaser" width="100%">
+</p>
+## 💡 Highlights
+- 🔥 **Decoupled & Plug-and-Play:** ETCHR functions as a separate module, allowing it to assist diverse downstream MLLMs (such as Qwen3-VL-8B, Gemini-3.1-Flash-Lite, or Kimi K2.5) without requiring any task-specific fine-tuning on the understanding models themselves.
+- 🔥 **Naturally Reflective Pipeline:** Introduces an Edit-Verify-Reason inference mechanism where the understanding model filters out noisy or flawed edits, reverting safely to the original image when verification fails.
+## 📊 Results
+We evaluate ETCHR across five distinct task families spanning fine-grained perception, chart understanding, logic reasoning, jigsaw restoration, and 3D understanding. Across all evaluated backbones, ETCHR consistently yields major improvements in Pass@1 accuracy:
+<p style="text-align: center;">
+  <img src="assets/result.png" alt="Pipeline" width="100%">
+</p>
+## 🛠️ Evaluation
+Prepare your environment:
+```bash
+git clone https://github.com/InternLM/ETCHR.git
+conda create -n ETCHR python==3.11
+conda activate ETCHR
+cd RL/Pref-GRPO
+bash env_setup.sh fastvideo
+pip install "vllm>=0.11.0"
+pip install qwen-vl-utils==0.0.14
+```
+We Provide an example code running ETCHR on [DL3DV-2K Benchmark](https://huggingface.co/datasets/internlm/DL3DV-2k) in ```[Evaluation/inference_dl3dv.py](https://github.com/InternLM/ETCHR/blob/master/Evaluation/inference_dl3dv.py)```, you can start the evaluation with the following two steps:
+**Step 1:** start a VLLM server for an understanding model (eg. Qwen3-VL-8B, Kimi K2.5, ...).
+```bash
+cd Evaluation
+bash launch_vllm.sh
+```
+**Step 2:** Run ETCHR atop any understanding model
+```bash
+python inference_dl3dv.py
+```
+## Cases
+ETCHR can assist with a broad spectrum of understanding tasks, including fine-grained perception, chart reasoning, maze navigation, jigsaw puzzles, and 3D spatial understanding.
+<p style="text-align: center;">
+  <img src="assets/case-3D.png" alt="case3D" width="100%">
+</p>
+<p style="text-align: center;">
+  <img src="assets/case-jigsaw.png" alt="casejigsaw" width="100%">
+</p>
+<p style="text-align: center;">
+  <img src="assets/case-maze.png" alt="casejigsaw" width="100%">
+</p>
+<p style="text-align: center;">
+  <img src="assets/case-chart.png" alt="casejigsaw" width="100%">
+</p>
+## 📄 License
+Our work is based on [FLUX.2-klein-base-9B](https://huggingface.co/black-forest-labs/FLUX.2-klein-base-9B), so please follow [FLUX Non-Commercial License](https://github.com/black-forest-labs/flux2/blob/main/model_licenses/LICENSE-FLUX-NON-COMMERICAL).
+## ✒️Citation
+If you find this project useful, please kindly cite:
+```
+```
 ---