arxiv:2604.09450

ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion

Published on Apr 10

· Submitted by

Chen Lifeng on Apr 13

Midea AI Research Center

Upvote

Authors:

Lifeng Chen ,

Abstract

ECHO is an efficient diffusion-based vision-language model for chest X-ray report generation that achieves faster inference through direct conditional distillation and response-asymmetric diffusion training while maintaining high clinical accuracy.

AI-generated summary

Chest X-ray report generation (CXR-RG) has the potential to substantially alleviate radiologists' workload. However, conventional autoregressive vision--language models (VLMs) suffer from high inference latency due to sequential token decoding. Diffusion-based models offer a promising alternative through parallel generation, but they still require multiple denoising iterations. Compressing multi-step denoising to a single step could further reduce latency, but often degrades textual coherence due to the mean-field bias introduced by token-factorized denoisers. To address this challenge, we propose ECHO, an efficient diffusion-based VLM (dVLM) for chest X-ray report generation. ECHO enables stable one-step-per-block inference via a novel Direct Conditional Distillation (DCD) framework, which mitigates the mean-field limitation by constructing unfactorized supervision from on-policy diffusion trajectories to encode joint token dependencies. In addition, we introduce a Response-Asymmetric Diffusion (RAD) training strategy that further improves training efficiency while maintaining model effectiveness. Extensive experiments demonstrate that ECHO surpasses state-of-the-art autoregressive methods, improving RaTE and SemScore by 64.33\% and 60.58\% respectively, while achieving an 8times inference speedup without compromising clinical accuracy.

View arXiv page View PDF Project page Add to collection

Community

Constant8868

Paper author Paper submitter about 8 hours ago

•

edited about 8 hours ago

🔥[New Paper] ECHO: Revolutionizing CXR Report Generation with 8x Speedup via One-step Block Diffusion!
Here is a quick summary of why we built ECHO, how it works, and our groundbreaking results.

🌟 1. Background & Motivation

Writing detailed Chest X-ray (CXR) reports is a massive daily burden for radiologists [citation:3][citation:7]. While current AI models (Autoregressive VLMs) can help automate this, they generate text word-by-word. This sequential process is simply too slow for high-throughput real-world hospitals.

Diffusion models offer a great alternative because they can generate all words in parallel. However, if you try to compress this into a single step for maximum speed, the model usually outputs incoherent gibberish. Why? Because it predicts every word independently without looking at the surrounding context (a flaw known as "mean-field bias").

So, the million-dollar question is: Can we achieve lightning-fast one-step generation without turning medical reports into gibberish? ECHO is our answer.

💡 2. Core Methodologies: How ECHO Works

To build ECHO, we designed a highly efficient three-stage training pipeline, introducing several novel techniques to solve the speed and hallucination issues:

Stage I: Data Normalization & Pre-training
In routine clinical practice, radiologists often omit normal findings. This implicit bias causes VLMs to hallucinate. We introduced a rigorous CXR data normalization paradigm that explicitly annotates both positive findings and negative assertions, providing unambiguous supervision.
Stage II: Response-Asymmetric Diffusion (RAD) Adaptation
Converting a traditional model to a diffusion model usually requires duplicating massive amounts of vision tokens (~2,870 for CXRs), which is computationally prohibitive. Our RAD strategy duplicates only the text response portion. This elegant design reduces theoretical training FLOPs by 72.3% and speeds up training by 3.61×!
Stage III: Direct Conditional Distillation (DCD)
To achieve stable one-step inference, we propose DCD. Instead of training the model to predict words independently, DCD extracts "unfactorized" supervision from a multi-step teacher model. By doing this, we successfully force the model to learn the context and dependencies between words in a single forward pass, eliminating the gibberish output.
Inference Optimization: Fused Block KV Cache
We designed a trick that merges the KV cache update of the previous text block into the current block's generation step. This introduces zero extra FLOPs while halving the total number of forward passes, maximizing real-world speed.

🏆 3. Groundbreaking Results

We comprehensively evaluated ECHO against general proprietary models (Gemini-3-Pro, Qwen3-Max), top-tier medical models (Lingshu, MedGemma-27B), and existing diffusion baselines. The results speak for themselves:

Unmatched Clinical Accuracy: ECHO shatters the performance ceiling of SOTA autoregressive methods, improving clinical accuracy metrics like RaTEScore by 64.33% and SemScore by 60.58%.
Extreme Efficiency: ECHO achieves an 8× inference speedup compared to traditional word-by-word decoding, with virtually zero compromise in clinical fidelity.
Robustness against Hallucinations: Thanks to our normalized training corpus and specific token supervision, ECHO significantly reduces false-positive fabrications and infinite repetition loops.

ECHO proves that high clinical accuracy and extreme parallel decoding efficiency are not mutually exclusive in medical AI!

🔗 Links & Resources

We are committed to open science and will be releasing our models and code to the community:

📄 Paper: https://arxiv.org/abs/2604.09450
🌐 Website: https://echo-midea-airc.github.io/
🤗 Model Collection (open soon): https://huggingface.co/datasets/AI4Manufacturing/forge
💻 Code (open soon): https://github.com/clf28/ECHO

Feel free to ask any questions or share your thoughts in the thread below. We'd love to discuss VLM acceleration, diffusion models, or medical AI with you all! 🚀🩺