知识引导下的胸部x线报告生成的跨模态对齐和渐进融合

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2024-12-24 DOI:10.1109/TMM.2024.3521728

Lili Huang;Yiming Cao;Pengcheng Jia;Chenglong Li;Jin Tang;Chuanfu Li

{"title":"知识引导下的胸部x线报告生成的跨模态对齐和渐进融合","authors":"Lili Huang;Yiming Cao;Pengcheng Jia;Chenglong Li;Jin Tang;Chuanfu Li","doi":"10.1109/TMM.2024.3521728","DOIUrl":null,"url":null,"abstract":"The task of chest X-ray report generation, which aims to simulate the diagnosis process of doctors, has received widespread attention. Compared with the image caption task, chest X-ray report generation is more challenging since it needs to generate a longer and more accurate description of each diagnostic part in chest X-ray images. Most of existing works focus on how to extract better visual features or more accurate text expression based on existing reports. However, they ignore the interactions between visual and text modalities and are thus obviously not in line with human thinking. A small part of works explore the interactions of visual and text modalities, but data-driven learning of cross-modal information mapping can not break the semantic gap between different modalities. In this work, we propose a novel approach called Knowledge-guided Cross-modal Alignment and Progressive fusion (KCAP), which takes the knowledge words from a created medical knowledge dictionary as the bridge to guide the cross-modal feature alignment and fusion, for accurate chest X-ray report generation. In particular, we create the medical knowledge dictionary by extracting medical phrases from the training set and then selecting some phrases with substantive meanings as knowledge words based on their frequency of occurrence. Based on the knowledge words from the medical knowledge dictionary, the visual and text modalities are interacted by a mapping layer for the enhancement of the features of two modalities, and then the alignment fusion module is introduced to mitigate the semantic gap between visual and text modalities. To retain the important details of the original information, we design a progressive fusion scheme to integrate the advantages of both salient fused and original features to generate better medical reports. The experimental results on IU-Xray and MIMIC datasets demonstrate the effectiveness of the proposed KCAP.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"557-567"},"PeriodicalIF":8.4000,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Knowledge-Guided Cross-Modal Alignment and Progressive Fusion for Chest X-Ray Report Generation\",\"authors\":\"Lili Huang;Yiming Cao;Pengcheng Jia;Chenglong Li;Jin Tang;Chuanfu Li\",\"doi\":\"10.1109/TMM.2024.3521728\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The task of chest X-ray report generation, which aims to simulate the diagnosis process of doctors, has received widespread attention. Compared with the image caption task, chest X-ray report generation is more challenging since it needs to generate a longer and more accurate description of each diagnostic part in chest X-ray images. Most of existing works focus on how to extract better visual features or more accurate text expression based on existing reports. However, they ignore the interactions between visual and text modalities and are thus obviously not in line with human thinking. A small part of works explore the interactions of visual and text modalities, but data-driven learning of cross-modal information mapping can not break the semantic gap between different modalities. In this work, we propose a novel approach called Knowledge-guided Cross-modal Alignment and Progressive fusion (KCAP), which takes the knowledge words from a created medical knowledge dictionary as the bridge to guide the cross-modal feature alignment and fusion, for accurate chest X-ray report generation. In particular, we create the medical knowledge dictionary by extracting medical phrases from the training set and then selecting some phrases with substantive meanings as knowledge words based on their frequency of occurrence. Based on the knowledge words from the medical knowledge dictionary, the visual and text modalities are interacted by a mapping layer for the enhancement of the features of two modalities, and then the alignment fusion module is introduced to mitigate the semantic gap between visual and text modalities. To retain the important details of the original information, we design a progressive fusion scheme to integrate the advantages of both salient fused and original features to generate better medical reports. The experimental results on IU-Xray and MIMIC datasets demonstrate the effectiveness of the proposed KCAP.\",\"PeriodicalId\":13273,\"journal\":{\"name\":\"IEEE Transactions on Multimedia\",\"volume\":\"27 \",\"pages\":\"557-567\"},\"PeriodicalIF\":8.4000,\"publicationDate\":\"2024-12-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multimedia\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10814666/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10814666/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

以模拟医生诊断过程为目的的胸部x线报告生成任务受到了广泛关注。与图像标题任务相比，胸部x线报告生成更具挑战性，因为它需要对胸部x线图像中的每个诊断部分生成更长的、更准确的描述。现有的工作大多集中在如何在现有报告的基础上提取更好的视觉特征或更准确的文本表达。然而，它们忽略了视觉和文本形式之间的相互作用，因此显然不符合人类的思维。一小部分研究探索了视觉模态和文本模态的交互作用，但跨模态信息映射的数据驱动学习并不能打破不同模态之间的语义差距。在这项工作中，我们提出了一种新的方法，称为知识引导跨模态对齐和渐进融合（KCAP），该方法以创建的医学知识字典中的知识词为桥梁，指导跨模态特征对齐和融合，以准确生成胸部x线报告。特别地，我们从训练集中提取医学短语，然后根据它们的出现频率选择一些具有实质意义的短语作为知识单词，从而创建医学知识词典。以医学知识词典中的知识词为基础，通过映射层实现视觉模式和文本模式的交互，增强两种模式的特征，然后引入对齐融合模块来缓解视觉模式和文本模式之间的语义差距。为了保留原始信息中的重要细节，我们设计了一种递进融合方案，将融合后的显著特征和原始特征的优点结合起来，生成更好的医疗报告。在u - x射线和MIMIC数据集上的实验结果证明了所提出的KCAP的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Knowledge-Guided Cross-Modal Alignment and Progressive Fusion for Chest X-Ray Report Generation

The task of chest X-ray report generation, which aims to simulate the diagnosis process of doctors, has received widespread attention. Compared with the image caption task, chest X-ray report generation is more challenging since it needs to generate a longer and more accurate description of each diagnostic part in chest X-ray images. Most of existing works focus on how to extract better visual features or more accurate text expression based on existing reports. However, they ignore the interactions between visual and text modalities and are thus obviously not in line with human thinking. A small part of works explore the interactions of visual and text modalities, but data-driven learning of cross-modal information mapping can not break the semantic gap between different modalities. In this work, we propose a novel approach called Knowledge-guided Cross-modal Alignment and Progressive fusion (KCAP), which takes the knowledge words from a created medical knowledge dictionary as the bridge to guide the cross-modal feature alignment and fusion, for accurate chest X-ray report generation. In particular, we create the medical knowledge dictionary by extracting medical phrases from the training set and then selecting some phrases with substantive meanings as knowledge words based on their frequency of occurrence. Based on the knowledge words from the medical knowledge dictionary, the visual and text modalities are interacted by a mapping layer for the enhancement of the features of two modalities, and then the alignment fusion module is introduced to mitigate the semantic gap between visual and text modalities. To retain the important details of the original information, we design a progressive fusion scheme to integrate the advantages of both salient fused and original features to generate better medical reports. The experimental results on IU-Xray and MIMIC datasets demonstrate the effectiveness of the proposed KCAP.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.