Cross-Modal Causal Representation Learning for Radiology Report Generation

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-03-16 DOI:10.1109/TIP.2025.3568746

Weixing Chen;Yang Liu;Ce Wang;Jiarui Zhu;Guanbin Li;Cheng-Lin Liu;Liang Lin

{"title":"Cross-Modal Causal Representation Learning for Radiology Report Generation","authors":"Weixing Chen;Yang Liu;Ce Wang;Jiarui Zhu;Guanbin Li;Cheng-Lin Liu;Liang Lin","doi":"10.1109/TIP.2025.3568746","DOIUrl":null,"url":null,"abstract":"Radiology Report Generation (RRG) is essential for computer-aided diagnosis and medication guidance, which can relieve the heavy burden of radiologists by automatically generating the corresponding radiology reports according to the given radiology image. However, generating accurate lesion descriptions remains challenging due to spurious correlations from visual-linguistic biases and inherent limitations of radiological imaging, such as low resolution and noise interference. To address these issues, we propose a two-stage framework named Cross-Modal Causal Representation Learning (CMCRL), consisting of the Radiological Cross-modal Alignment and Reconstruction Enhanced (RadCARE) pre-training and the Visual-Linguistic Causal Intervention (VLCI) fine-tuning. In the pre-training stage, RadCARE introduces a degradation-aware masked image restoration strategy tailored for radiological images, which reconstructs high-resolution patches from low-resolution inputs to mitigate noise and detail loss. Combined with a multiway architecture and four adaptive training strategies (e.g., text postfix generation with degraded images and text prefixes), RadCARE establishes robust cross-modal correlations even with incomplete data. In the VLCI phase, we deploy causal front-door intervention through two modules: the Visual Deconfounding Module (VDM) disentangles local-global features without fine-grained annotations, while the Linguistic Deconfounding Module (LDM) eliminates context bias without external terminology databases. Experiments on IU-Xray and MIMIC-CXR show that our CMCRL pipeline significantly outperforms state-of-the-art methods, with ablation studies confirming the necessity of both stages. Code and models are available at <uri>https://github.com/WissingChen/CMCRL</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2970-2985"},"PeriodicalIF":0.0000,"publicationDate":"2025-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11005686/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Radiology Report Generation (RRG) is essential for computer-aided diagnosis and medication guidance, which can relieve the heavy burden of radiologists by automatically generating the corresponding radiology reports according to the given radiology image. However, generating accurate lesion descriptions remains challenging due to spurious correlations from visual-linguistic biases and inherent limitations of radiological imaging, such as low resolution and noise interference. To address these issues, we propose a two-stage framework named Cross-Modal Causal Representation Learning (CMCRL), consisting of the Radiological Cross-modal Alignment and Reconstruction Enhanced (RadCARE) pre-training and the Visual-Linguistic Causal Intervention (VLCI) fine-tuning. In the pre-training stage, RadCARE introduces a degradation-aware masked image restoration strategy tailored for radiological images, which reconstructs high-resolution patches from low-resolution inputs to mitigate noise and detail loss. Combined with a multiway architecture and four adaptive training strategies (e.g., text postfix generation with degraded images and text prefixes), RadCARE establishes robust cross-modal correlations even with incomplete data. In the VLCI phase, we deploy causal front-door intervention through two modules: the Visual Deconfounding Module (VDM) disentangles local-global features without fine-grained annotations, while the Linguistic Deconfounding Module (LDM) eliminates context bias without external terminology databases. Experiments on IU-Xray and MIMIC-CXR show that our CMCRL pipeline significantly outperforms state-of-the-art methods, with ablation studies confirming the necessity of both stages. Code and models are available at https://github.com/WissingChen/CMCRL.

查看原文本刊更多论文

面向放射学报告生成的跨模态因果表示学习

放射学报告生成（Radiology Report Generation， RRG）是计算机辅助诊断和用药指导的关键，它可以根据给定的放射图像自动生成相应的放射学报告，减轻放射科医生的繁重负担。然而，由于视觉语言偏差和放射成像的固有局限性（如低分辨率和噪声干扰）造成的虚假相关性，生成准确的病变描述仍然具有挑战性。为了解决这些问题，我们提出了一个名为跨模态因果表征学习（CMCRL）的两阶段框架，由放射学跨模态对齐和重建增强（RadCARE）预训练和视觉语言因果干预（VLCI）微调组成。在预训练阶段，RadCARE引入了一种针对放射图像定制的退化感知掩膜图像恢复策略，该策略从低分辨率输入重建高分辨率补丁，以减轻噪声和细节损失。结合多路架构和四种自适应训练策略（例如，使用退化的图像和文本前缀生成文本后缀），RadCARE即使在数据不完整的情况下也能建立鲁棒的跨模态相关性。在VLCI阶段，我们通过两个模块部署因果前门干预：视觉反发现模块（VDM）在没有细粒度注释的情况下解开局部-全局特征，而语言反发现模块（LDM）在没有外部术语数据库的情况下消除上下文偏见。u - x射线和MIMIC-CXR实验表明，我们的CMCRL管道明显优于最先进的方法，烧蚀研究证实了这两个阶段的必要性。代码和模型可在https://github.com/WissingChen/CMCRL上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量