MedCLIP-SAMv2: Towards universal text-driven medical image segmentation

IF 11.8 1区医学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Medical image analysis Pub Date : 2025-08-05 DOI:10.1016/j.media.2025.103749

Taha Koleilat , Hojat Asgariandehkordi , Hassan Rivaz , Yiming Xiao

{"title":"MedCLIP-SAMv2: Towards universal text-driven medical image segmentation","authors":"Taha Koleilat , Hojat Asgariandehkordi , Hassan Rivaz , Yiming Xiao","doi":"10.1016/j.media.2025.103749","DOIUrl":null,"url":null,"abstract":"<div><div>Segmentation of anatomical structures and pathologies in medical images is essential for modern disease diagnosis, clinical research, and treatment planning. While significant advancements have been made in deep learning-based segmentation techniques, many of these methods still suffer from limitations in data efficiency, generalizability, and interactivity. As a result, developing robust segmentation methods that require fewer labeled datasets remains a critical challenge in medical image analysis. Recently, the introduction of foundation models like CLIP and Segment-Anything-Model (SAM), with robust cross-domain representations, has paved the way for interactive and universal image segmentation. However, further exploration of these models for data-efficient segmentation in medical imaging is an active field of research. In this paper, we introduce MedCLIP-SAMv2, a novel framework that integrates the CLIP and SAM models to perform segmentation on clinical scans using text prompts, in both zero-shot and weakly supervised settings. Our approach includes fine-tuning the BiomedCLIP model with a new Decoupled Hard Negative Noise Contrastive Estimation (DHN-NCE) loss, and leveraging the Multi-modal Information Bottleneck (M2IB) to create visual prompts for generating segmentation masks with SAM in the zero-shot setting. We also investigate using zero-shot segmentation labels in a weakly supervised paradigm to enhance segmentation quality further. Extensive validation across four diverse segmentation tasks and medical imaging modalities (breast tumor ultrasound, brain tumor MRI, lung X-ray, and lung CT) demonstrates the high accuracy of our proposed framework. Our code is available at <span><span>https://github.com/HealthX-Lab/MedCLIP-SAMv2</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"107 ","pages":"Article 103749"},"PeriodicalIF":11.8000,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical image analysis","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1361841525002968","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Segmentation of anatomical structures and pathologies in medical images is essential for modern disease diagnosis, clinical research, and treatment planning. While significant advancements have been made in deep learning-based segmentation techniques, many of these methods still suffer from limitations in data efficiency, generalizability, and interactivity. As a result, developing robust segmentation methods that require fewer labeled datasets remains a critical challenge in medical image analysis. Recently, the introduction of foundation models like CLIP and Segment-Anything-Model (SAM), with robust cross-domain representations, has paved the way for interactive and universal image segmentation. However, further exploration of these models for data-efficient segmentation in medical imaging is an active field of research. In this paper, we introduce MedCLIP-SAMv2, a novel framework that integrates the CLIP and SAM models to perform segmentation on clinical scans using text prompts, in both zero-shot and weakly supervised settings. Our approach includes fine-tuning the BiomedCLIP model with a new Decoupled Hard Negative Noise Contrastive Estimation (DHN-NCE) loss, and leveraging the Multi-modal Information Bottleneck (M2IB) to create visual prompts for generating segmentation masks with SAM in the zero-shot setting. We also investigate using zero-shot segmentation labels in a weakly supervised paradigm to enhance segmentation quality further. Extensive validation across four diverse segmentation tasks and medical imaging modalities (breast tumor ultrasound, brain tumor MRI, lung X-ray, and lung CT) demonstrates the high accuracy of our proposed framework. Our code is available at https://github.com/HealthX-Lab/MedCLIP-SAMv2.

查看原文本刊更多论文

MedCLIP-SAMv2：迈向通用文本驱动的医学图像分割。

医学图像中解剖结构和病理的分割对于现代疾病诊断、临床研究和治疗计划至关重要。虽然基于深度学习的分割技术取得了重大进展，但其中许多方法在数据效率、通用性和交互性方面仍然受到限制。因此，开发需要较少标记数据集的鲁棒分割方法仍然是医学图像分析中的关键挑战。近年来，具有鲁棒跨域表示的基础模型如CLIP和Segment-Anything-Model （SAM）的引入，为交互式和通用图像分割铺平了道路。然而，进一步探索这些模型在医学成像中的数据高效分割是一个活跃的研究领域。在本文中，我们介绍了MedCLIP-SAMv2，这是一个集成CLIP和SAM模型的新框架，可以在零射击和弱监督设置下使用文本提示对临床扫描进行分割。我们的方法包括使用新的解耦硬负噪声对比估计（dnn - nce）损失对生物医学clip模型进行微调，并利用多模态信息瓶颈（M2IB）创建视觉提示，以便在零拍摄设置中使用SAM生成分割掩码。我们还研究了在弱监督范式中使用零采样分割标签来进一步提高分割质量。通过四种不同的分割任务和医学成像模式（乳腺肿瘤超声、脑肿瘤MRI、肺部x射线和肺部CT）的广泛验证表明，我们提出的框架具有很高的准确性。我们的代码可在https://github.com/HealthX-Lab/MedCLIP-SAMv2上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Medical image analysis 工程技术-工程：生物医学

CiteScore

22.10

自引率

6.40%

发文量

309

审稿时长

6.6 months

期刊介绍： Medical Image Analysis serves as a platform for sharing new research findings in the realm of medical and biological image analysis, with a focus on applications of computer vision, virtual reality, and robotics to biomedical imaging challenges. The journal prioritizes the publication of high-quality, original papers contributing to the fundamental science of processing, analyzing, and utilizing medical and biological images. It welcomes approaches utilizing biomedical image datasets across all spatial scales, from molecular/cellular imaging to tissue/organ imaging.