ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion

2026-04-10 • Machine Learning

Machine LearningArtificial Intelligence

AI summaryⓘ

The authors developed a new model called ECHO to generate chest X-ray reports faster and more accurately. Traditional models generate text one word at a time, which is slow, while diffusion-based models generate multiple tokens at once but usually need many steps. ECHO compresses these steps into one, using a new training method that keeps the text coherent by better capturing token relationships. Their approach improves report quality significantly and speeds up the report generation by eight times without losing important medical details.

Chest X-ray report generationVision-language modelsAutoregressive modelsDiffusion modelsToken decodingDirect Conditional DistillationMean-field biasResponse-Asymmetric DiffusionInference latencyRadiology AI

Authors

Lifeng Chen, Tianqi You, Hao Liu, Zhimin Bao, Jile Jiao, Xiao Han, Zhicai Ou, Tao Sun, Xiaofeng Mou, Xiaojie Jin, Yi Xu

Abstract

Chest X-ray report generation (CXR-RG) has the potential to substantially alleviate radiologists' workload. However, conventional autoregressive vision--language models (VLMs) suffer from high inference latency due to sequential token decoding. Diffusion-based models offer a promising alternative through parallel generation, but they still require multiple denoising iterations. Compressing multi-step denoising to a single step could further reduce latency, but often degrades textual coherence due to the mean-field bias introduced by token-factorized denoisers. To address this challenge, we propose \textbf{ECHO}, an efficient diffusion-based VLM (dVLM) for chest X-ray report generation. ECHO enables stable one-step-per-block inference via a novel Direct Conditional Distillation (DCD) framework, which mitigates the mean-field limitation by constructing unfactorized supervision from on-policy diffusion trajectories to encode joint token dependencies. In addition, we introduce a Response-Asymmetric Diffusion (RAD) training strategy that further improves training efficiency while maintaining model effectiveness. Extensive experiments demonstrate that ECHO surpasses state-of-the-art autoregressive methods, improving RaTE and SemScore by \textbf{64.33\%} and \textbf{60.58\%} respectively, while achieving an \textbf{$8\times$} inference speedup without compromising clinical accuracy.

View PDFOpen arXiv