M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models and Latent Space Geometry Optimization

Medical image computing and computer-assisted intervention : MICCAI ... International Conference on Medical Image Computing and Computer-Assisted Intervention Pub Date : 2023-07-17 DOI:10.48550/arXiv.2307.08347

Che Liu, Sibo Cheng, Chen Chen, Mengyun Qiao, Weitong Zhang, Anand Shah, Wenjia Bai, Rossella Arcucci

{"title":"M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models and Latent Space Geometry Optimization","authors":"Che Liu, Sibo Cheng, Chen Chen, Mengyun Qiao, Weitong Zhang, Anand Shah, Wenjia Bai, Rossella Arcucci","doi":"10.48550/arXiv.2307.08347","DOIUrl":null,"url":null,"abstract":"Medical vision-language models enable co-learning and integrating features from medical imaging and clinical text. However, these models are not easy to train and the latent representation space can be complex. Here we propose a novel way for pre-training and regularising medical vision-language models. The proposed method, named Medical vision-language pre-training with Frozen language models and Latent spAce Geometry optimization (M-FLAG), leverages a frozen language model for training stability and efficiency and introduces a novel orthogonality loss to harmonize the latent space geometry. We demonstrate the potential of the pre-trained model on three downstream tasks: medical image classification, segmentation, and object detection. Extensive experiments across five public datasets demonstrate that M-FLAG significantly outperforms existing medical vision-language pre-training approaches and reduces the number of parameters by 78\\%. Notably, M-FLAG achieves outstanding performance on the segmentation task while using only 1\\% of the RSNA dataset, even outperforming ImageNet pre-trained models that have been fine-tuned using 100\\% of the data.","PeriodicalId":18289,"journal":{"name":"Medical image computing and computer-assisted intervention : MICCAI ... International Conference on Medical Image Computing and Computer-Assisted Intervention","volume":"16 1","pages":"637-647"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical image computing and computer-assisted intervention : MICCAI ... International Conference on Medical Image Computing and Computer-Assisted Intervention","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2307.08347","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Medical vision-language models enable co-learning and integrating features from medical imaging and clinical text. However, these models are not easy to train and the latent representation space can be complex. Here we propose a novel way for pre-training and regularising medical vision-language models. The proposed method, named Medical vision-language pre-training with Frozen language models and Latent spAce Geometry optimization (M-FLAG), leverages a frozen language model for training stability and efficiency and introduces a novel orthogonality loss to harmonize the latent space geometry. We demonstrate the potential of the pre-trained model on three downstream tasks: medical image classification, segmentation, and object detection. Extensive experiments across five public datasets demonstrate that M-FLAG significantly outperforms existing medical vision-language pre-training approaches and reduces the number of parameters by 78\%. Notably, M-FLAG achieves outstanding performance on the segmentation task while using only 1\% of the RSNA dataset, even outperforming ImageNet pre-trained models that have been fine-tuned using 100\% of the data.

查看原文本刊更多论文

M-FLAG:基于冻结语言模型和潜在空间几何优化的医学视觉语言预训练

医学视觉语言模型使共同学习和整合医学影像和临床文本的特征成为可能。然而，这些模型不容易训练，并且潜在表示空间可能很复杂。本文提出了一种新的医学视觉语言模型的预训练和正则化方法。本文提出的基于冻结语言模型和潜在空间几何优化(M-FLAG)的医学视觉语言预训练方法，利用冻结语言模型提高训练的稳定性和效率，并引入一种新的正交性损失来协调潜在空间几何。我们展示了预训练模型在三个下游任务上的潜力:医学图像分类、分割和目标检测。在五个公共数据集上进行的大量实验表明，M-FLAG显著优于现有的医学视觉语言预训练方法，并将参数数量减少了78%。值得注意的是，M-FLAG仅使用1%的RSNA数据集就在分割任务上取得了出色的性能，甚至超过了使用100%的数据进行微调的ImageNet预训练模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Medical image computing and computer-assisted intervention : MICCAI ... International Conference on Medical Image Computing and Computer-Assisted Intervention

自引率

0.00%

发文量