Image-to-Image
Diffusers
Safetensors
Diffusion Single File
English
Flux2KleinPipeline
image-generation
image-editing
flux
Instructions to use internlm/ETCHR-FLUX.2-klein-9B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use internlm/ETCHR-FLUX.2-klein-9B with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline from diffusers.utils import load_image # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("internlm/ETCHR-FLUX.2-klein-9B", dtype=torch.bfloat16, device_map="cuda") prompt = "Turn this cat into a dog" input_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png") image = pipe(image=input_image, prompt=prompt).images[0] - Diffusion Single File
How to use internlm/ETCHR-FLUX.2-klein-9B with Diffusion Single File:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
| license: other | |
| license_name: flux-non-commercial-license | |
| license_link: LICENSE | |
| language: | |
| - en | |
| base_model: | |
| - FLUX.2-klein-base-9B | |
| pipeline_tag: image-to-image | |
| tags: | |
| - image-generation | |
| - image-editing | |
| - flux | |
| - diffusion-single-file | |
| library_name: diffusers | |
| # ETCHR-FLUX.2-klein-9B | |
| ๐<a href="https://arxiv.org/abs/2605.23897">Paper</a> | |
| | ๐ <a href="https://github.com/InternLM/ETCHR">Homepage</a > | |
| | ๐ค<a href="https://huggingface.co/internlm/ETCHR-FLUX.2-klein-9B">ETCHR-FLUX.2-klein-9B Model</a > | |
| | ๐ค<a href="https://huggingface.co/datasets/BeichenZhang/ETCHR-SFT-400K">ETCHR SFT-400K Dataset</a > | |
| | ๐ค<a href="https://huggingface.co/datasets/internlm/ETCHR-GRPO-10K">ETCHR GRPO-10K Dataset</a > | |
| | ๐ค<a href="https://huggingface.co/datasets/internlm/DL3DV-2k">DL3DV-2K Benchmark</a > | |
| ETCHR-FLUX.2-klein-9B is a novel question-conditioned, reasoning-aware image editor designed to serve as a decoupled visual reasoning assistant for Multimodal Large Language Models. By decoupling the specialized image editor from the downstream understanding model, ETCHR bridges the critical bottleneck where a purely textual chain of thought fails in fine-grained focus or complex spatial transformations. | |
| ## ๐ข News | |
| - ๐ [2026/05/22] We have released the training and evaluation code of ETCHR. | |
| - ๐ [2026/05/21] We have released the [ETCHR-FLUX.2-klein-9B Model](https://huggingface.co/internlm/ETCHR-FLUX.2-klein-9B), [ETCHR-SFT-400K Dataset](https://huggingface.co/datasets/BeichenZhang/ETCHR-SFT-400K) and [ETCHR GRPO-10K Dataset](https://huggingface.co/datasets/internlm/ETCHR-GRPO-10K). | |
| ## ๐ Overview | |
| We are thrilled to introduce ETCHR (Editing To Clarify and Harness Reasoning), a novel question-conditioned, reasoning-aware image editor built on [FLUX.2-klein-base-9B](https://huggingface.co/black-forest-labs/FLUX.2-klein-base-9B) designed to serve as a decoupled visual reasoning assistant for Multimodal Large Language Models (MLLMs). | |
| By decoupling the specialized image editor from the downstream understanding model, ETCHR bridges the critical bottleneck where a purely textual chain of thought fails in fine-grained focus or complex spatial transformations. | |
| </p> | |
| <p style="text-align: center;"> | |
| <img src="assets/overview.png" alt="Teaser" width="100%"> | |
| </p> | |
| ## ๐ก Highlights | |
| - ๐ฅ **Decoupled & Plug-and-Play:** ETCHR functions as a separate module, allowing it to assist diverse downstream MLLMs (such as Qwen3-VL-8B, Gemini-3.1-Flash-Lite, or Kimi K2.5) without requiring any task-specific fine-tuning on the understanding models themselves. | |
| - ๐ฅ **Naturally Reflective Pipeline:** Introduces an Edit-Verify-Reason inference mechanism where the understanding model filters out noisy or flawed edits, reverting safely to the original image when verification fails. | |
| ## ๐ Results | |
| We evaluate ETCHR across five distinct task families spanning fine-grained perception, chart understanding, logic reasoning, jigsaw restoration, and 3D understanding. Across all evaluated backbones, ETCHR consistently yields major improvements in Pass@1 accuracy: | |
| <p style="text-align: center;"> | |
| <img src="assets/result.png" alt="Pipeline" width="100%"> | |
| </p> | |
| ## ๐ ๏ธ Evaluation | |
| Prepare your environment: | |
| ```bash | |
| git clone https://github.com/InternLM/ETCHR.git | |
| conda create -n ETCHR python==3.11 | |
| conda activate ETCHR | |
| cd RL/Pref-GRPO | |
| bash env_setup.sh fastvideo | |
| pip install "vllm>=0.11.0" | |
| pip install qwen-vl-utils==0.0.14 | |
| ``` | |
| We Provide an example code running ETCHR on [DL3DV-2K Benchmark](https://huggingface.co/datasets/internlm/DL3DV-2k) in [Evaluation/inference_dl3dv.py](https://github.com/InternLM/ETCHR/blob/master/Evaluation/inference_dl3dv.py), you can start the evaluation with the following two steps: | |
| **Step 1:** start a VLLM server for an understanding model (eg. Qwen3-VL-8B, Kimi K2.5, ...). | |
| ```bash | |
| cd Evaluation | |
| bash launch_vllm.sh | |
| ``` | |
| **Step 2:** Run ETCHR atop any understanding model | |
| ```bash | |
| python inference_dl3dv.py | |
| ``` | |
| ## Cases | |
| ETCHR can assist with a broad spectrum of understanding tasks, including fine-grained perception, chart reasoning, maze navigation, jigsaw puzzles, and 3D spatial understanding. | |
| <p style="text-align: center;"> | |
| <img src="assets/case-3D.png" alt="case3D" width="100%"> | |
| </p> | |
| <p style="text-align: center;"> | |
| <img src="assets/case-jigsaw.png" alt="casejigsaw" width="100%"> | |
| </p> | |
| <p style="text-align: center;"> | |
| <img src="assets/case-maze.png" alt="casejigsaw" width="100%"> | |
| </p> | |
| <p style="text-align: center;"> | |
| <img src="assets/case-chart.png" alt="casejigsaw" width="100%"> | |
| </p> | |
| ## ๐ License | |
| Our work is based on [FLUX.2-klein-base-9B](https://huggingface.co/black-forest-labs/FLUX.2-klein-base-9B), so please follow [FLUX Non-Commercial License](https://github.com/black-forest-labs/flux2/blob/main/model_licenses/LICENSE-FLUX-NON-COMMERICAL). | |
| ## โ๏ธCitation | |
| If you find this project useful, please kindly cite: | |
| ``` | |
| @article{zhang2026etchr, | |
| title={ETCHR: Editing To Clarify and Harness Reasoning}, | |
| author={Beichen Zhang, Yuhong Liu, Jinsong Li, Yuhang Zang, Jiaqi Wang, Dahua Lin}, | |
| journal={arXiv preprint arXiv:2605.23897}, | |
| year={2026} | |
| } | |
| ``` | |
| ## โค๏ธ Acknowledgement | |
| The base model is [FLUX.2-klein-base-9B](https://huggingface.co/black-forest-labs/FLUX.2-klein-base-9B), a powerful image-to-image model. | |
| The work is built upon <a href="https://github.com/modelscope/DiffSynth-Studio">DiffSynth-Studio</a > and <a href="https://github.com/CodeGoat24/Pref-GRPO">Pref-GRPO</a >, two excellent codebases for Diffusion models training! | |
| --- | |