Muhammad Umair Ali , Amad Zafar , Seonghan Kim , Kwang Su Kim , Seung Won Lee
{"title":"From task-specific to foundation models: A paradigm shift in medical vision-language analysis","authors":"Muhammad Umair Ali , Amad Zafar , Seonghan Kim , Kwang Su Kim , Seung Won Lee","doi":"10.1016/j.cosrev.2025.100831","DOIUrl":null,"url":null,"abstract":"<div><div>Integrating vision-language models (VLMs) into medical imaging drives a paradigm shift from task-specific systems toward generalist foundation models (FMs) capable of zero-shot and few-shot reasoning across diverse clinical domains. This review presents a comprehensive model-centric taxonomy, categorizing over 135 studies into three key developmental stages: (1) task-specific VLMs, (2) modular/adapter-based/prompt-tuned VLMs, and (3) foundation models. We systematically assess each category regarding architectural innovations, learning paradigms, clinical applications, and evaluation metrics. Our analysis reveals that the recent advances in multimodal contrastive learning, prompt engineering, and scalable transformer-based architectures significantly enhance generalizability, data efficiency, and multimodal interpretability in medical AI. Furthermore, we synthesize bibliometric trends and delineate methodological transitions through a PRISMA-based systematic review. This review article concludes with a discussion on the challenges and provides a roadmap for developing clinically reliable, data-efficient, and versatile VLMs, highlighting their transformative potential for improving diagnostic accuracy, workflow automation, and decision support in healthcare.</div></div>","PeriodicalId":48633,"journal":{"name":"Computer Science Review","volume":"59 ","pages":"Article 100831"},"PeriodicalIF":12.7000,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Science Review","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1574013725001078","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Integrating vision-language models (VLMs) into medical imaging drives a paradigm shift from task-specific systems toward generalist foundation models (FMs) capable of zero-shot and few-shot reasoning across diverse clinical domains. This review presents a comprehensive model-centric taxonomy, categorizing over 135 studies into three key developmental stages: (1) task-specific VLMs, (2) modular/adapter-based/prompt-tuned VLMs, and (3) foundation models. We systematically assess each category regarding architectural innovations, learning paradigms, clinical applications, and evaluation metrics. Our analysis reveals that the recent advances in multimodal contrastive learning, prompt engineering, and scalable transformer-based architectures significantly enhance generalizability, data efficiency, and multimodal interpretability in medical AI. Furthermore, we synthesize bibliometric trends and delineate methodological transitions through a PRISMA-based systematic review. This review article concludes with a discussion on the challenges and provides a roadmap for developing clinically reliable, data-efficient, and versatile VLMs, highlighting their transformative potential for improving diagnostic accuracy, workflow automation, and decision support in healthcare.
期刊介绍:
Computer Science Review, a publication dedicated to research surveys and expository overviews of open problems in computer science, targets a broad audience within the field seeking comprehensive insights into the latest developments. The journal welcomes articles from various fields as long as their content impacts the advancement of computer science. In particular, articles that review the application of well-known Computer Science methods to other areas are in scope only if these articles advance the fundamental understanding of those methods.