Zhenwei Wang , Qiule Sun , Bingbing Zhang , Pengfei Wang , Jianxin Zhang , Qiang Zhang
{"title":"PM2: A new prompting multi-modal model paradigm for few-shot medical image classification","authors":"Zhenwei Wang , Qiule Sun , Bingbing Zhang , Pengfei Wang , Jianxin Zhang , Qiang Zhang","doi":"10.1016/j.cmpb.2025.109042","DOIUrl":null,"url":null,"abstract":"<div><h3>Background and Objective:</h3><div>Few-shot learning has emerged as a key technological solution to address challenges such as limited data and the difficulty of acquiring annotations in medical image classification. However, relying solely on a single image modality is insufficient to capture conceptual categories. Therefore, medical image classification requires a comprehensive approach to capture conceptual category information that aids in the interpretation of image content.</div></div><div><h3>Methods:</h3><div>This study proposes a novel medical image classification paradigm based on a multi-modal foundation model, called PM<sup>2</sup>. In addition to the image modality, PM<sup>2</sup> introduces supplementary text input (prompt) to further describe images or conceptual categories and facilitate cross-modal few-shot learning. We empirically studied five different prompting schemes under this new paradigm. Furthermore, linear probing in multi-modal models only takes class token as input, ignoring the rich statistical data contained in high-level visual tokens. Therefore, we alternately perform linear classification on the feature distributions of visual tokens and class token. To effectively extract statistical information, we use global covariance pool with efficient matrix power normalization to aggregate the visual tokens. We then combine two classification heads: one for handling image class token and prompt representations encoded by the text encoder, and the other for classifying the feature distributions of visual tokens.</div></div><div><h3>Results:</h3><div>Experimental results on three datasets: breast cancer, brain tumor, and diabetic retinopathy demonstrate that PM<sup>2</sup> effectively improves the performance of medical image classification. Compared to existing multi-modal models, PM<sup>2</sup> achieves state-of-the-art performance.</div></div><div><h3>Conclusions:</h3><div>Integrating text prompts as supplementary samples effectively enhances the model’s performance. Additionally, by leveraging second-order features of visual tokens to enrich the category feature space and combining them with class token, the model’s representational capacity is significantly strengthened.</div></div>","PeriodicalId":10624,"journal":{"name":"Computer methods and programs in biomedicine","volume":"272 ","pages":"Article 109042"},"PeriodicalIF":4.8000,"publicationDate":"2025-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer methods and programs in biomedicine","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169260725004596","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
Background and Objective:
Few-shot learning has emerged as a key technological solution to address challenges such as limited data and the difficulty of acquiring annotations in medical image classification. However, relying solely on a single image modality is insufficient to capture conceptual categories. Therefore, medical image classification requires a comprehensive approach to capture conceptual category information that aids in the interpretation of image content.
Methods:
This study proposes a novel medical image classification paradigm based on a multi-modal foundation model, called PM2. In addition to the image modality, PM2 introduces supplementary text input (prompt) to further describe images or conceptual categories and facilitate cross-modal few-shot learning. We empirically studied five different prompting schemes under this new paradigm. Furthermore, linear probing in multi-modal models only takes class token as input, ignoring the rich statistical data contained in high-level visual tokens. Therefore, we alternately perform linear classification on the feature distributions of visual tokens and class token. To effectively extract statistical information, we use global covariance pool with efficient matrix power normalization to aggregate the visual tokens. We then combine two classification heads: one for handling image class token and prompt representations encoded by the text encoder, and the other for classifying the feature distributions of visual tokens.
Results:
Experimental results on three datasets: breast cancer, brain tumor, and diabetic retinopathy demonstrate that PM2 effectively improves the performance of medical image classification. Compared to existing multi-modal models, PM2 achieves state-of-the-art performance.
Conclusions:
Integrating text prompts as supplementary samples effectively enhances the model’s performance. Additionally, by leveraging second-order features of visual tokens to enrich the category feature space and combining them with class token, the model’s representational capacity is significantly strengthened.
期刊介绍:
To encourage the development of formal computing methods, and their application in biomedical research and medical practice, by illustration of fundamental principles in biomedical informatics research; to stimulate basic research into application software design; to report the state of research of biomedical information processing projects; to report new computer methodologies applied in biomedical areas; the eventual distribution of demonstrable software to avoid duplication of effort; to provide a forum for discussion and improvement of existing software; to optimize contact between national organizations and regional user groups by promoting an international exchange of information on formal methods, standards and software in biomedicine.
Computer Methods and Programs in Biomedicine covers computing methodology and software systems derived from computing science for implementation in all aspects of biomedical research and medical practice. It is designed to serve: biochemists; biologists; geneticists; immunologists; neuroscientists; pharmacologists; toxicologists; clinicians; epidemiologists; psychiatrists; psychologists; cardiologists; chemists; (radio)physicists; computer scientists; programmers and systems analysts; biomedical, clinical, electrical and other engineers; teachers of medical informatics and users of educational software.