MedCLIP: Contrastive Learning from Unpaired Medical Images and Text.

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing Pub Date : 2022-12-01 DOI:10.18653/v1/2022.emnlp-main.256

Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, Jimeng Sun

{"title":"MedCLIP: Contrastive Learning from Unpaired Medical Images and Text.","authors":"Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, Jimeng Sun","doi":"10.18653/v1/2022.emnlp-main.256","DOIUrl":null,"url":null,"abstract":"<p><p>Existing vision-text contrastive learning like CLIP (Radford et al., 2021) aims to match the paired image and caption embeddings while pushing others apart, which improves representation transferability and supports zero-shot prediction. However, medical image-text datasets are orders of magnitude below the general images and captions from the internet. Moreover, previous methods encounter many false negatives, i.e., images and reports from separate patients probably carry the same semantics but are wrongly treated as negatives. In this paper, we decouple images and texts for multimodal contrastive learning thus scaling the usable training data in a combinatorial magnitude with low cost. We also propose to replace the InfoNCE loss with semantic matching loss based on medical knowledge to eliminate false negatives in contrastive learning. We prove that MedCLIP is a simple yet effective framework: it outperforms state-of-the-art methods on zero-shot prediction, supervised classification, and image-text retrieval. Surprisingly, we observe that with only 20K pre-training data, MedCLIP wins over the state-of-the-art method (using ≈200K data).</p>","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"2022 ","pages":"3876-3887"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11323634/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2022.emnlp-main.256","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Existing vision-text contrastive learning like CLIP (Radford et al., 2021) aims to match the paired image and caption embeddings while pushing others apart, which improves representation transferability and supports zero-shot prediction. However, medical image-text datasets are orders of magnitude below the general images and captions from the internet. Moreover, previous methods encounter many false negatives, i.e., images and reports from separate patients probably carry the same semantics but are wrongly treated as negatives. In this paper, we decouple images and texts for multimodal contrastive learning thus scaling the usable training data in a combinatorial magnitude with low cost. We also propose to replace the InfoNCE loss with semantic matching loss based on medical knowledge to eliminate false negatives in contrastive learning. We prove that MedCLIP is a simple yet effective framework: it outperforms state-of-the-art methods on zero-shot prediction, supervised classification, and image-text retrieval. Surprisingly, we observe that with only 20K pre-training data, MedCLIP wins over the state-of-the-art method (using ≈200K data).

查看原文本刊更多论文

MedCLIP：从非配对医学图像和文本中进行对比学习。

现有的视觉-文本对比学习，如 CLIP（Radford 等人，2021 年），旨在匹配配对的图像和标题嵌入，同时将其他图像和标题推开，从而提高表示的可转移性并支持零镜头预测。然而，医学图像-文本数据集比互联网上的普通图像和标题低几个数量级。此外，以前的方法会遇到许多假阴性，即来自不同患者的图像和报告可能具有相同的语义，但却被错误地视为阴性。在本文中，我们将图像和文本解耦，用于多模态对比学习，从而以较低的成本在组合量级上扩展可用的训练数据。我们还建议用基于医学知识的语义匹配损失取代 InfoNCE 损失，以消除对比学习中的假阴性。我们证明，MedCLIP 是一个简单而有效的框架：它在零镜头预测、监督分类和图像文本检索方面都优于最先进的方法。令人惊讶的是，我们发现，只需 20K 预训练数据，MedCLIP 就能战胜最先进的方法（使用 ≈200K 数据）。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing

自引率

0.00%

发文量