--- license: other license_name: flux-non-commercial-license license_link: LICENSE language: - en base_model: - FLUX.2-klein-base-9B pipeline_tag: image-to-image tags: - image-generation - image-editing - flux - diffusion-single-file library_name: diffusers --- # ETCHR-FLUX.2-klein-9B πŸ“–Paper | 🏠Homepage | πŸ€—ETCHR-FLUX.2-klein-9B Model | πŸ€—ETCHR SFT-400K Dataset | πŸ€—ETCHR GRPO-10K Dataset | πŸ€—DL3DV-2K Benchmark ETCHR-FLUX.2-klein-9B is a novel question-conditioned, reasoning-aware image editor designed to serve as a decoupled visual reasoning assistant for Multimodal Large Language Models. By decoupling the specialized image editor from the downstream understanding model, ETCHR bridges the critical bottleneck where a purely textual chain of thought fails in fine-grained focus or complex spatial transformations. ## πŸ“’ News - πŸš€ [2026/05/22] We have released the training and evaluation code of ETCHR. - πŸš€ [2026/05/21] We have released the [ETCHR-FLUX.2-klein-9B Model](https://huggingface.co/internlm/ETCHR-FLUX.2-klein-9B), [ETCHR-SFT-400K Dataset](https://huggingface.co/datasets/BeichenZhang/ETCHR-SFT-400K) and [ETCHR GRPO-10K Dataset](https://huggingface.co/datasets/internlm/ETCHR-GRPO-10K). ## 🌈 Overview We are thrilled to introduce ETCHR (Editing To Clarify and Harness Reasoning), a novel question-conditioned, reasoning-aware image editor built on [FLUX.2-klein-base-9B](https://huggingface.co/black-forest-labs/FLUX.2-klein-base-9B) designed to serve as a decoupled visual reasoning assistant for Multimodal Large Language Models (MLLMs). By decoupling the specialized image editor from the downstream understanding model, ETCHR bridges the critical bottleneck where a purely textual chain of thought fails in fine-grained focus or complex spatial transformations.

Teaser

## πŸ’‘ Highlights - πŸ”₯ **Decoupled & Plug-and-Play:** ETCHR functions as a separate module, allowing it to assist diverse downstream MLLMs (such as Qwen3-VL-8B, Gemini-3.1-Flash-Lite, or Kimi K2.5) without requiring any task-specific fine-tuning on the understanding models themselves. - πŸ”₯ **Naturally Reflective Pipeline:** Introduces an Edit-Verify-Reason inference mechanism where the understanding model filters out noisy or flawed edits, reverting safely to the original image when verification fails. ## πŸ“Š Results We evaluate ETCHR across five distinct task families spanning fine-grained perception, chart understanding, logic reasoning, jigsaw restoration, and 3D understanding. Across all evaluated backbones, ETCHR consistently yields major improvements in Pass@1 accuracy:

Pipeline

## πŸ› οΈ Evaluation Prepare your environment: ```bash git clone https://github.com/InternLM/ETCHR.git conda create -n ETCHR python==3.11 conda activate ETCHR cd RL/Pref-GRPO bash env_setup.sh fastvideo pip install "vllm>=0.11.0" pip install qwen-vl-utils==0.0.14 ``` We Provide an example code running ETCHR on [DL3DV-2K Benchmark](https://huggingface.co/datasets/internlm/DL3DV-2k) in [Evaluation/inference_dl3dv.py](https://github.com/InternLM/ETCHR/blob/master/Evaluation/inference_dl3dv.py), you can start the evaluation with the following two steps: **Step 1:** start a VLLM server for an understanding model (eg. Qwen3-VL-8B, Kimi K2.5, ...). ```bash cd Evaluation bash launch_vllm.sh ``` **Step 2:** Run ETCHR atop any understanding model ```bash python inference_dl3dv.py ``` ## Cases ETCHR can assist with a broad spectrum of understanding tasks, including fine-grained perception, chart reasoning, maze navigation, jigsaw puzzles, and 3D spatial understanding.

case3D

casejigsaw

casejigsaw

casejigsaw

## πŸ“„ License Our work is based on [FLUX.2-klein-base-9B](https://huggingface.co/black-forest-labs/FLUX.2-klein-base-9B), so please follow [FLUX Non-Commercial License](https://github.com/black-forest-labs/flux2/blob/main/model_licenses/LICENSE-FLUX-NON-COMMERICAL). ## βœ’οΈCitation If you find this project useful, please kindly cite: ``` @article{zhang2026etchr, title={ETCHR: Editing To Clarify and Harness Reasoning}, author={Beichen Zhang, Yuhong Liu, Jinsong Li, Yuhang Zang, Jiaqi Wang, Dahua Lin}, journal={arXiv preprint arXiv:2605.23897}, year={2026} } ``` ## ❀️ Acknowledgement The base model is [FLUX.2-klein-base-9B](https://huggingface.co/black-forest-labs/FLUX.2-klein-base-9B), a powerful image-to-image model. The work is built upon DiffSynth-Studio and Pref-GRPO, two excellent codebases for Diffusion models training! ---