PM2: A new prompting multi-modal model paradigm for few-shot medical image classification

IF 4.8 2区医学 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computer methods and programs in biomedicine Pub Date : 2025-09-06 DOI:10.1016/j.cmpb.2025.109042

Zhenwei Wang , Qiule Sun , Bingbing Zhang , Pengfei Wang , Jianxin Zhang , Qiang Zhang

{"title":"PM2: A new prompting multi-modal model paradigm for few-shot medical image classification","authors":"Zhenwei Wang , Qiule Sun , Bingbing Zhang , Pengfei Wang , Jianxin Zhang , Qiang Zhang","doi":"10.1016/j.cmpb.2025.109042","DOIUrl":null,"url":null,"abstract":"<div><h3>Background and Objective:</h3><div>Few-shot learning has emerged as a key technological solution to address challenges such as limited data and the difficulty of acquiring annotations in medical image classification. However, relying solely on a single image modality is insufficient to capture conceptual categories. Therefore, medical image classification requires a comprehensive approach to capture conceptual category information that aids in the interpretation of image content.</div></div><div><h3>Methods:</h3><div>This study proposes a novel medical image classification paradigm based on a multi-modal foundation model, called PM<sup>2</sup>. In addition to the image modality, PM<sup>2</sup> introduces supplementary text input (prompt) to further describe images or conceptual categories and facilitate cross-modal few-shot learning. We empirically studied five different prompting schemes under this new paradigm. Furthermore, linear probing in multi-modal models only takes class token as input, ignoring the rich statistical data contained in high-level visual tokens. Therefore, we alternately perform linear classification on the feature distributions of visual tokens and class token. To effectively extract statistical information, we use global covariance pool with efficient matrix power normalization to aggregate the visual tokens. We then combine two classification heads: one for handling image class token and prompt representations encoded by the text encoder, and the other for classifying the feature distributions of visual tokens.</div></div><div><h3>Results:</h3><div>Experimental results on three datasets: breast cancer, brain tumor, and diabetic retinopathy demonstrate that PM<sup>2</sup> effectively improves the performance of medical image classification. Compared to existing multi-modal models, PM<sup>2</sup> achieves state-of-the-art performance.</div></div><div><h3>Conclusions:</h3><div>Integrating text prompts as supplementary samples effectively enhances the model’s performance. Additionally, by leveraging second-order features of visual tokens to enrich the category feature space and combining them with class token, the model’s representational capacity is significantly strengthened.</div></div>","PeriodicalId":10624,"journal":{"name":"Computer methods and programs in biomedicine","volume":"272 ","pages":"Article 109042"},"PeriodicalIF":4.8000,"publicationDate":"2025-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer methods and programs in biomedicine","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169260725004596","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Background and Objective:

Few-shot learning has emerged as a key technological solution to address challenges such as limited data and the difficulty of acquiring annotations in medical image classification. However, relying solely on a single image modality is insufficient to capture conceptual categories. Therefore, medical image classification requires a comprehensive approach to capture conceptual category information that aids in the interpretation of image content.

Methods:

This study proposes a novel medical image classification paradigm based on a multi-modal foundation model, called PM². In addition to the image modality, PM² introduces supplementary text input (prompt) to further describe images or conceptual categories and facilitate cross-modal few-shot learning. We empirically studied five different prompting schemes under this new paradigm. Furthermore, linear probing in multi-modal models only takes class token as input, ignoring the rich statistical data contained in high-level visual tokens. Therefore, we alternately perform linear classification on the feature distributions of visual tokens and class token. To effectively extract statistical information, we use global covariance pool with efficient matrix power normalization to aggregate the visual tokens. We then combine two classification heads: one for handling image class token and prompt representations encoded by the text encoder, and the other for classifying the feature distributions of visual tokens.

Results:

Experimental results on three datasets: breast cancer, brain tumor, and diabetic retinopathy demonstrate that PM² effectively improves the performance of medical image classification. Compared to existing multi-modal models, PM² achieves state-of-the-art performance.

Conclusions:

Integrating text prompts as supplementary samples effectively enhances the model’s performance. Additionally, by leveraging second-order features of visual tokens to enrich the category feature space and combining them with class token, the model’s representational capacity is significantly strengthened.

查看原文本刊更多论文

PM2：一种新的提示多模态医学图像分类范式

背景与目的：针对医学图像分类中数据有限、标注难以获取等问题，Few-shot学习已经成为一种关键的技术解决方案。然而，仅仅依靠单一的图像形态是不足以捕捉概念范畴的。因此，医学图像分类需要一种全面的方法来捕获有助于解释图像内容的概念类别信息。方法：提出了一种基于多模态基础模型PM2的医学图像分类新范式。除了图像模态之外，PM2还引入了补充文本输入（提示）来进一步描述图像或概念类别，并促进跨模态的少镜头学习。我们对这一新范式下的五种不同激励方案进行了实证研究。此外，多模态模型中的线性探测只接受类标记作为输入，忽略了高级视觉标记中包含的丰富统计数据。因此，我们交替对视觉标记和类标记的特征分布进行线性分类。为了有效地提取统计信息，我们使用全局协方差池和高效的矩阵幂归一化对视觉标记进行聚合。然后我们结合两个分类头：一个用于处理图像类标记和由文本编码器编码的提示表示，另一个用于分类视觉标记的特征分布。结果：在乳腺癌、脑肿瘤和糖尿病视网膜病变三个数据集上的实验结果表明，PM2有效地提高了医学图像分类的性能。与现有的多模态模型相比，PM2实现了最先进的性能。结论：将文本提示作为补充样本，有效地提高了模型的性能。此外，通过利用视觉标记的二阶特征丰富类别特征空间，并将其与类标记相结合，显著增强了模型的表征能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer methods and programs in biomedicine 工程技术-工程：生物医学

CiteScore

12.30

自引率

6.60%

发文量

601

审稿时长

135 days

期刊介绍： To encourage the development of formal computing methods, and their application in biomedical research and medical practice, by illustration of fundamental principles in biomedical informatics research; to stimulate basic research into application software design; to report the state of research of biomedical information processing projects; to report new computer methodologies applied in biomedical areas; the eventual distribution of demonstrable software to avoid duplication of effort; to provide a forum for discussion and improvement of existing software; to optimize contact between national organizations and regional user groups by promoting an international exchange of information on formal methods, standards and software in biomedicine. Computer Methods and Programs in Biomedicine covers computing methodology and software systems derived from computing science for implementation in all aspects of biomedical research and medical practice. It is designed to serve: biochemists; biologists; geneticists; immunologists; neuroscientists; pharmacologists; toxicologists; clinicians; epidemiologists; psychiatrists; psychologists; cardiologists; chemists; (radio)physicists; computer scientists; programmers and systems analysts; biomedical, clinical, electrical and other engineers; teachers of medical informatics and users of educational software.