Enhancing Medical Vision-Language Contrastive Learning via Inter-Matching Relation Modeling

IEEE transactions on medical imaging Pub Date : 2025-01-29 DOI:10.1109/TMI.2025.3534436

Mingjian Li;Mingyuan Meng;Michael Fulham;David Dagan Feng;Lei Bi;Jinman Kim

{"title":"Enhancing Medical Vision-Language Contrastive Learning via Inter-Matching Relation Modeling","authors":"Mingjian Li;Mingyuan Meng;Michael Fulham;David Dagan Feng;Lei Bi;Jinman Kim","doi":"10.1109/TMI.2025.3534436","DOIUrl":null,"url":null,"abstract":"Medical image representations can be learned through medical vision-language contrastive learning (mVLCL) where medical imaging reports are used as weak supervision through image-text alignment. These learned image representations can be transferred to and benefit various downstream medical vision tasks such as disease classification and segmentation. Recent mVLCL methods attempt to align image sub-regions and the report keywords as local-matchings. However, these methods aggregate all local-matchings via simple pooling operations while ignoring the inherent relations between them. These methods therefore fail to reason between local-matchings that are semantically related, e.g., local-matchings that correspond to the disease word and the location word (semantic-relations), and also fail to differentiate such clinically important local-matchings from others that correspond to less meaningful words, e.g., conjunction words (importance-relations). Hence, we propose a mVLCL method that models the inter-matching relations between local-matchings via a relation-enhanced contrastive learning framework (RECLF). In RECLF, we introduce a semantic-relation reasoning module (SRM) and an importance-relation reasoning module (IRM) to enable more fine-grained report supervision for image representation learning. We evaluated our method using six public benchmark datasets on four downstream tasks, including segmentation, zero-shot classification, linear classification, and cross-modal retrieval. Our results demonstrated the superiority of our RECLF over the state-of-the-art mVLCL methods with consistent improvements across single-modal and cross-modal tasks. These results suggest that our RECLF, by modeling the inter-matching relations, can learn improved medical image representations with better generalization capabilities.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 6","pages":"2463-2476"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on medical imaging","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10858000/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Medical image representations can be learned through medical vision-language contrastive learning (mVLCL) where medical imaging reports are used as weak supervision through image-text alignment. These learned image representations can be transferred to and benefit various downstream medical vision tasks such as disease classification and segmentation. Recent mVLCL methods attempt to align image sub-regions and the report keywords as local-matchings. However, these methods aggregate all local-matchings via simple pooling operations while ignoring the inherent relations between them. These methods therefore fail to reason between local-matchings that are semantically related, e.g., local-matchings that correspond to the disease word and the location word (semantic-relations), and also fail to differentiate such clinically important local-matchings from others that correspond to less meaningful words, e.g., conjunction words (importance-relations). Hence, we propose a mVLCL method that models the inter-matching relations between local-matchings via a relation-enhanced contrastive learning framework (RECLF). In RECLF, we introduce a semantic-relation reasoning module (SRM) and an importance-relation reasoning module (IRM) to enable more fine-grained report supervision for image representation learning. We evaluated our method using six public benchmark datasets on four downstream tasks, including segmentation, zero-shot classification, linear classification, and cross-modal retrieval. Our results demonstrated the superiority of our RECLF over the state-of-the-art mVLCL methods with consistent improvements across single-modal and cross-modal tasks. These results suggest that our RECLF, by modeling the inter-matching relations, can learn improved medical image representations with better generalization capabilities.

查看原文本刊更多论文

基于相互匹配关系建模的医学视觉语言对比学习

医学图像表征可以通过医学视觉语言对比学习（mVLCL）来学习，其中医学图像报告通过图像-文本对齐作为弱监督。这些学习到的图像表征可以转移到下游的各种医学视觉任务中，如疾病分类和分割。最近的mVLCL方法尝试将图像子区域和报告关键字对齐为局部匹配。然而，这些方法通过简单的池化操作聚合所有本地匹配，而忽略了它们之间的内在关系。因此，这些方法无法在语义相关的局部匹配之间进行推理，例如，对应于疾病词和位置词的局部匹配（语义关系），也无法将这些临床上重要的局部匹配与对应于意义较小的词的其他局部匹配区分开来，例如，连接词（重要性关系）。因此，我们提出了一种mVLCL方法，该方法通过关系增强对比学习框架（RECLF）对局部匹配之间的相互匹配关系进行建模。在RECLF中，我们引入了一个语义关系推理模块（SRM）和一个重要关系推理模块（IRM），以便为图像表示学习提供更细粒度的报告监督。我们使用6个公共基准数据集对4个下游任务进行了评估，包括分割、零采样分类、线性分类和跨模态检索。我们的结果表明，我们的RECLF优于最先进的mVLCL方法，在单模态和跨模态任务中都有一致的改进。这些结果表明，我们的RECLF通过对相互匹配关系进行建模，可以学习到改进的医学图像表示，具有更好的泛化能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on medical imaging

自引率

0.00%

发文量