Cross-Modal Augmented Transformer for Automated Medical Report Generation

IF 3.7 3区医学 Q2 ENGINEERING, BIOMEDICAL

IEEE Journal of Translational Engineering in Health and Medicine-Jtehm Pub Date : 2025-01-29 DOI:10.1109/JTEHM.2025.3536441

Yuhao Tang;Ye Yuan;Fei Tao;Minghao Tang

{"title":"Cross-Modal Augmented Transformer for Automated Medical Report Generation","authors":"Yuhao Tang;Ye Yuan;Fei Tao;Minghao Tang","doi":"10.1109/JTEHM.2025.3536441","DOIUrl":null,"url":null,"abstract":"In clinical practice, interpreting medical images and composing diagnostic reports typically involve significant manual workload. Therefore, an automated report generation framework that mimics a doctor’s diagnosis better meets the requirements of medical scenarios. Prior investigations often overlook this critical aspect, primarily relying on traditional image captioning frameworks initially designed for general-domain images and sentences. Despite achieving some advancements, these methodologies encounter two primary challenges. First, the strong noise in blurred medical images always hinders the model of capturing the lesion region. Second, during report writing, doctors typically rely on terminology for diagnosis, a crucial aspect that has been neglected in prior frameworks. In this paper, we present a novel approach called Cross-modal Augmented Transformer (CAT) for medical report generation. Unlike previous methods that rely on coarse-grained features without human intervention, our method introduces a “locate then generate” pattern, thereby improving the interpretability of the generated reports. During the locate stage, CAT captures crucial representations by pre-aligning significant patches and their corresponding medical terminologies. This pre-alignment helps reduce visual noise by discarding low-ranking content, ensuring that only relevant information is considered in the report generation process. During the generation phase, CAT utilizes a multi-modality encoder to reinforce the correlation between generated keywords, retrieved terminologies and regions. Furthermore, CAT employs a dual-stream decoder that dynamically determines whether the predicted word should be influenced by the retrieved terminology or the preceding sentence. Experimental results demonstrate the effectiveness of the proposed method on two datasets.Clinical impact: This work aims to design an automated framework for explaining medical images to evaluate the health status of individuals, thereby facilitating their broader application in clinical settings.Clinical and Translational Impact Statement: In our preclinical research, we develop an automated system for generating diagnostic reports. This system mimics manual diagnostic methods by combining fine-grained semantic alignment with dual-stream decoders.","PeriodicalId":54255,"journal":{"name":"IEEE Journal of Translational Engineering in Health and Medicine-Jtehm","volume":"13 ","pages":"33-48"},"PeriodicalIF":3.7000,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10857391","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Translational Engineering in Health and Medicine-Jtehm","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10857391/","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}

引用次数: 0

Abstract

In clinical practice, interpreting medical images and composing diagnostic reports typically involve significant manual workload. Therefore, an automated report generation framework that mimics a doctor’s diagnosis better meets the requirements of medical scenarios. Prior investigations often overlook this critical aspect, primarily relying on traditional image captioning frameworks initially designed for general-domain images and sentences. Despite achieving some advancements, these methodologies encounter two primary challenges. First, the strong noise in blurred medical images always hinders the model of capturing the lesion region. Second, during report writing, doctors typically rely on terminology for diagnosis, a crucial aspect that has been neglected in prior frameworks. In this paper, we present a novel approach called Cross-modal Augmented Transformer (CAT) for medical report generation. Unlike previous methods that rely on coarse-grained features without human intervention, our method introduces a “locate then generate” pattern, thereby improving the interpretability of the generated reports. During the locate stage, CAT captures crucial representations by pre-aligning significant patches and their corresponding medical terminologies. This pre-alignment helps reduce visual noise by discarding low-ranking content, ensuring that only relevant information is considered in the report generation process. During the generation phase, CAT utilizes a multi-modality encoder to reinforce the correlation between generated keywords, retrieved terminologies and regions. Furthermore, CAT employs a dual-stream decoder that dynamically determines whether the predicted word should be influenced by the retrieved terminology or the preceding sentence. Experimental results demonstrate the effectiveness of the proposed method on two datasets.Clinical impact: This work aims to design an automated framework for explaining medical images to evaluate the health status of individuals, thereby facilitating their broader application in clinical settings.Clinical and Translational Impact Statement: In our preclinical research, we develop an automated system for generating diagnostic reports. This system mimics manual diagnostic methods by combining fine-grained semantic alignment with dual-stream decoders.

查看原文本刊更多论文

用于自动医疗报告生成的跨模态增强变压器

在临床实践中，解释医学图像和撰写诊断报告通常涉及大量的手工工作量。因此，模仿医生诊断的自动化报告生成框架更能满足医疗场景的需求。之前的研究往往忽略了这一关键方面，主要依赖于传统的图像标题框架，最初是为一般领域的图像和句子设计的。尽管取得了一些进展，但这些方法遇到了两个主要挑战。首先，模糊医学图像中较强的噪声会阻碍模型对病灶区域的捕捉。其次，在撰写报告时，医生通常依赖于诊断术语，这是先前框架中被忽视的一个关键方面。在本文中，我们提出了一种新的方法，称为跨模态增强变压器（CAT）的医疗报告生成。与以前依赖于粗粒度特征而没有人为干预的方法不同，我们的方法引入了“定位然后生成”模式，从而提高了生成报告的可解释性。在定位阶段，CAT通过预先对齐重要补丁及其相应的医学术语来捕获关键表征。这种预对齐通过丢弃低排名的内容来帮助减少视觉噪音，确保在报告生成过程中只考虑相关的信息。在生成阶段，CAT使用多模态编码器来加强生成的关键字、检索的术语和区域之间的相关性。此外，CAT采用双流解码器，动态地确定预测的单词是否应该受到检索术语或前一句的影响。实验结果证明了该方法在两个数据集上的有效性。临床影响：这项工作旨在设计一个自动化框架来解释医学图像，以评估个人的健康状况，从而促进其在临床环境中的更广泛应用。临床和转化影响声明：在我们的临床前研究中，我们开发了一个自动生成诊断报告的系统。该系统通过将细粒度语义对齐与双流解码器相结合来模拟人工诊断方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Journal of Translational Engineering in Health and Medicine-Jtehm Engineering-Biomedical Engineering

CiteScore

7.40

自引率

2.90%

发文量

审稿时长

27 weeks

期刊介绍： The IEEE Journal of Translational Engineering in Health and Medicine is an open access product that bridges the engineering and clinical worlds, focusing on detailed descriptions of advanced technical solutions to a clinical need along with clinical results and healthcare relevance. The journal provides a platform for state-of-the-art technology directions in the interdisciplinary field of biomedical engineering, embracing engineering, life sciences and medicine. A unique aspect of the journal is its ability to foster a collaboration between physicians and engineers for presenting broad and compelling real world technological and engineering solutions that can be implemented in the interest of improving quality of patient care and treatment outcomes, thereby reducing costs and improving efficiency. The journal provides an active forum for clinical research and relevant state-of the-art technology for members of all the IEEE societies that have an interest in biomedical engineering as well as reaching out directly to physicians and the medical community through the American Medical Association (AMA) and other clinical societies. The scope of the journal includes, but is not limited, to topics on: Medical devices, healthcare delivery systems, global healthcare initiatives, and ICT based services; Technological relevance to healthcare cost reduction; Technology affecting healthcare management, decision-making, and policy; Advanced technical work that is applied to solving specific clinical needs.