{"title":"Navigating the landscape of multimodal AI in medicine: A scoping review on technical challenges and clinical applications","authors":"Daan Schouten , Giulia Nicoletti , Bas Dille , Catherine Chia , Pierpaolo Vendittelli , Megan Schuurmans , Geert Litjens , Nadieh Khalili","doi":"10.1016/j.media.2025.103621","DOIUrl":"10.1016/j.media.2025.103621","url":null,"abstract":"<div><div>Recent technological advances in healthcare have led to unprecedented growth in patient data quantity and diversity. While artificial intelligence (AI) models have shown promising results in analyzing individual data modalities, there is increasing recognition that models integrating multiple complementary data sources, so-called multimodal AI, could enhance clinical decision-making. This scoping review examines the landscape of deep learning-based multimodal AI applications across the medical domain, analyzing 432 papers published between 2018 and 2024. We provide an extensive overview of multimodal AI development across different medical disciplines, examining various architectural approaches, fusion strategies, and common application areas. Our analysis reveals that multimodal AI models consistently outperform their unimodal counterparts, with an average improvement of 6.2 percentage points in AUC. However, several challenges persist, including cross-departmental coordination, heterogeneous data characteristics, and incomplete datasets. We critically assess the technical and practical challenges in developing multimodal AI systems and discuss potential strategies for their clinical implementation, including a brief overview of commercially available multimodal AI models for clinical decision-making. Additionally, we identify key factors driving multimodal AI development and propose recommendations to accelerate the field’s maturation. This review provides researchers and clinicians with a thorough understanding of the current state, challenges, and future directions of multimodal AI in medicine.</div></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"105 ","pages":"Article 103621"},"PeriodicalIF":10.7,"publicationDate":"2025-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144221852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep learning detection of acute and sub-acute lesion activity from single-timepoint conventional brain MRI in multiple sclerosis","authors":"Quentin Spinat , Benoit Audelan , Xiaotong Jiang , Bastien Caba , Alexis Benichoux , Despoina Ioannidou , Olivier Teboul , Nikos Komodakis , Willem Huijbers , Refaat Gabr , Arie Gafson , Colm Elliott , Douglas Arnold , Nikos Paragios , Shibeshih Belachew","doi":"10.1016/j.media.2025.103619","DOIUrl":"10.1016/j.media.2025.103619","url":null,"abstract":"<div><div>Multiple sclerosis (MS) is a chronic inflammatory disease characterized by demyelinating lesions in the central nervous system. Cross-sectional measurements of acute inflammatory lesion activity are typically obtained by detecting the presence of gadolinium enhancement in lesions, which typically lasts 3-6 weeks. We formulate the novel and clinically relevant task of quantification of recent acute lesion activity from the past 24 weeks (6 months) using single-timepoint conventional brain magnetic resonance imaging (MRI). We develop and compare several deep learning (DL) methods for estimating this brain-level acuteness score and show that a 2D-UNet can accurately predict acute disease activity at the patient-level while outperforming transformers and ensemble approaches. In the context of identifying subjects with acute (less than 6 months-old) lesion activity, our 2D-UNet achieves an area under the receiver-operating curve in the range <span><math><mrow><mn>80</mn><mo>−</mo><mn>84</mn><mtext>%</mtext></mrow></math></span> on independent relapsing-remitting MS cohorts. When used in conjunction with measurements of gadolinium-enhancing lesion activity, our model significantly improves the prognostication of future acute lesion activity (over the next 6 months). This model could thus be leveraged for population recruitment in clinical trials to identify a higher number of patients with acute inflammatory activity than current standard approaches (e.g., gadolinium positivity) with a predictable precision/recall trade-off.</div></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"105 ","pages":"Article 103619"},"PeriodicalIF":10.7,"publicationDate":"2025-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144254410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luyang Luo , Xin Huang , Minghao Wang , Zhuoyue Wan , Wanteng Ma , Hao Chen
{"title":"Mitigating medical dataset bias by learning adaptive agreement from a biased council","authors":"Luyang Luo , Xin Huang , Minghao Wang , Zhuoyue Wan , Wanteng Ma , Hao Chen","doi":"10.1016/j.media.2025.103629","DOIUrl":"10.1016/j.media.2025.103629","url":null,"abstract":"<div><div>Dataset bias in images is an important yet less explored topic in medical images. Deep learning could be prone to learning spurious correlation raised by dataset bias, resulting in inaccurate, unreliable, and unfair models, which impedes its adoption in real-world clinical applications. Despite its significance, there is a dearth of research in the medical image classification domain to address dataset bias. Furthermore, the bias labels are often agnostic, as identifying biases can be laborious and depend on post-hoc interpretation. This paper proposes learning Adaptive Agreement from a Biased Council (Ada-ABC), a debiasing framework that does not rely on explicit bias labels to tackle dataset bias in medical images. Ada-ABC develops a biased council consisting of multiple classifiers optimized with generalized cross entropy loss to learn the dataset bias. A debiasing model is then simultaneously trained under the guidance of the biased council. Specifically, the debiasing model is required to learn adaptive agreement with the biased council by agreeing on the correctly predicted samples and disagreeing on the wrongly predicted samples by the biased council. In this way, the debiasing model could learn the target attribute on the samples without spurious correlations while also avoiding ignoring the rich information in samples with spurious correlations. We theoretically demonstrated that the debiasing model could learn the target features when the biased model successfully captures dataset bias. Moreover, we constructed the first medical debiasing benchmark focusing on addressing spurious correlation from four datasets containing seven different bias scenarios. Our extensive experiments practically showed that our proposed Ada-ABC outperformed competitive approaches, verifying its effectiveness in mitigating dataset bias for medical image classification. The codes and organized benchmark datasets can be accessed via <span><span>https://github.com/LLYXC/Ada-ABC</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"105 ","pages":"Article 103629"},"PeriodicalIF":10.7,"publicationDate":"2025-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144212441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zehui Lin , Shuo Li , Shanshan Wang , Zhifan Gao , Yue Sun , Chan-Tong Lam , Xindi Hu , Xin Yang , Dong Ni , Tao Tan
{"title":"An orchestration learning framework for ultrasound imaging: Prompt-Guided Hyper-Perception and Attention-Matching Downstream Synchronization","authors":"Zehui Lin , Shuo Li , Shanshan Wang , Zhifan Gao , Yue Sun , Chan-Tong Lam , Xindi Hu , Xin Yang , Dong Ni , Tao Tan","doi":"10.1016/j.media.2025.103639","DOIUrl":"10.1016/j.media.2025.103639","url":null,"abstract":"<div><div>Ultrasound imaging is pivotal in clinical diagnostics due to its affordability, portability, safety, real-time capability, and non-invasive nature. It is widely utilized for examining various organs, such as the breast, thyroid, ovary, cardiac, and more. However, the manual interpretation and annotation of ultrasound images are time-consuming and prone to variability among physicians. While single-task artificial intelligence (AI) solutions have been explored, they are not ideal for scaling AI applications in medical imaging. Foundation models, although a trending solution, often struggle with real-world medical datasets due to factors such as noise, variability, and the incapability of flexibly aligning prior knowledge with task adaptation. To address these limitations, we propose an orchestration learning framework named PerceptGuide for general-purpose ultrasound classification and segmentation. Our framework incorporates a novel orchestration mechanism based on prompted hyper-perception, which adapts to the diverse inductive biases required by different ultrasound datasets. Unlike self-supervised pre-trained models, which require extensive fine-tuning, our approach leverages supervised pre-training to directly capture task-relevant features, providing a stronger foundation for multi-task and multi-organ ultrasound imaging. To support this research, we compiled a large-scale Multi-task, Multi-organ public ultrasound dataset (M<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>-US), featuring images from 9 organs and 16 datasets, encompassing both classification and segmentation tasks. Our approach employs four specific prompts—Object, Task, Input, and Position—to guide the model, ensuring task-specific adaptability. Additionally, a downstream synchronization training stage is introduced to fine-tune the model for new data, significantly improving generalization capabilities and enabling real-world applications. Experimental results demonstrate the robustness and versatility of our framework in handling multi-task and multi-organ ultrasound image processing, outperforming both specialist models and existing general AI solutions. Compared to specialist models, our method improves segmentation from 82.26% to 86.45%, classification from 71.30% to 79.08%, while also significantly reducing model parameters.</div></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"104 ","pages":"Article 103639"},"PeriodicalIF":10.7,"publicationDate":"2025-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144168328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qianhui Men , He Zhao , Lior Drukker , Aris T. Papageorghiou , J. Alison Noble
{"title":"ScanAhead: Simplifying standard plane acquisition of fetal head ultrasound","authors":"Qianhui Men , He Zhao , Lior Drukker , Aris T. Papageorghiou , J. Alison Noble","doi":"10.1016/j.media.2025.103614","DOIUrl":"10.1016/j.media.2025.103614","url":null,"abstract":"<div><div>The fetal standard plane acquisition task aims to detect an Ultrasound (US) image characterized by specified anatomical landmarks and appearance for assessing fetal growth. However, in practice, due to variability in human operator skill and possible fetal motion, it can be challenging for a human operator to acquire a satisfactory standard plane. To support a human operator with this task, this paper first describes an approach to automatically predict the fetal head standard plane from a video segment approaching the standard plane. A transformer-based image predictor is proposed to produce a high-quality standard plane by understanding diverse scales of head anatomy within the US video frame. Because of the visual gap between the video frames and standard plane image, the predictor is equipped with an offset adaptor that performs domain adaption to translate the off-plane structures to the anatomies that would usually appear in a standard plane view. To enhance the anatomical details of the predicted US image, the approach is extended by utilizing a second modality, US probe movement, that provides 3D location information. Quantitative and qualitative studies conducted on two different head biometry planes demonstrate that the proposed US image predictor produces clinically plausible standard planes with superior performance to comparative published methods. The results of dual-modality solution show an improved visualization with enhanced anatomical details of the predicted US image. Clinical evaluations are also conducted to demonstrate the consistency between the predicted echo textures and the expected echo patterns seen in a typical real standard plane, which indicates its clinical feasibility for improving the standard plane acquisition process.</div></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"104 ","pages":"Article 103614"},"PeriodicalIF":10.7,"publicationDate":"2025-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144168330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yanjie Zhou , Feng Zhou , Fengjun Xi , Yong Liu , Yun Peng , David E. Carlson , Liyun Tu
{"title":"Efficient few-shot medical image segmentation via self-supervised variational autoencoder","authors":"Yanjie Zhou , Feng Zhou , Fengjun Xi , Yong Liu , Yun Peng , David E. Carlson , Liyun Tu","doi":"10.1016/j.media.2025.103637","DOIUrl":"10.1016/j.media.2025.103637","url":null,"abstract":"<div><div>Few-shot medical image segmentation typically uses a joint model for registration and segmentation. The registration model aligns a labeled atlas with unlabeled images to form initial masks, which are then refined by the segmentation model. However, inevitable spatial misalignments during registration can lead to inaccuracies and diminished segmentation quality. To address this, we developed EFS-MedSeg, an end-to-end model using two labeled atlases and few unlabeled images, enhanced by data augmentation and self-supervised learning. Initially, EFS-MedSeg applies a 3D random regional switch strategy to augment atlases, thereby enhancing supervision in segmentation tasks. This not only introduces variability to the training data but also enhances the model’s ability to generalize and prevents overfitting, resulting in natural and smooth label boundaries. Following this, we use a variational autoencoder for a weighted reconstruction task, focusing the model’s attention on areas with lower Dice scores to ensure accurate segmentation that conforms to the atlas image’s shape and structural appearance. Moreover, we introduce a self-contrastive module aimed at improving feature extraction, guided by anatomical structure priors, thus enhancing the model’s convergence and segmentation accuracy. Results on multi-modal medical image datasets show that EFS-MedSeg achieves performance comparable to fully-supervised methods. Moreover, it consistently surpasses the second-best method in Dice score by 1.4%, 9.1%, and 1.1% on the OASIS, BCV, and BCH datasets, respectively, highlighting its robustness and adaptability across diverse datasets. The source code will be made publicly available at: <span><span>https://github.com/NoviceFodder/EFS-MedSeg</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"104 ","pages":"Article 103637"},"PeriodicalIF":10.7,"publicationDate":"2025-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144168351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pengfei Xia , Dehua Chen , Huimin An , Kiat Shenq Lim , Xiaoqun Yang
{"title":"Learnable prototype-guided multiple instance learning for detecting tertiary lymphoid structures in multi-cancer whole-slide pathological images","authors":"Pengfei Xia , Dehua Chen , Huimin An , Kiat Shenq Lim , Xiaoqun Yang","doi":"10.1016/j.media.2025.103652","DOIUrl":"10.1016/j.media.2025.103652","url":null,"abstract":"<div><div>Tertiary lymphoid structures (TLS) are ectopic lymphoid aggregates that form under specific pathological conditions, such as chronic inflammation and malignancies. Their presence within the tumor microenvironment (TME) is strongly correlated with patient prognosis and response to immunotherapy, making TLS detection in whole-slide pathological images (WSIs) crucial for clinical decision-making. Although multiple instance learning (MIL) has shown promise in tumor microenvironment studies, its potential for TLS detection has received limited attention. Additionally, the sparsity and heterogeneity of TLS in WSIs present significant challenges for feature extraction and limit the generalizability of MIL across different cancer types. To address this issue, this paper proposes a weakly supervised framework, Learnable Prototype-Guided Multiple Instance Learning (LPGMIL). From the perspective of the cellular composition of TLS, LPGMIL selects lymphocyte-dense instances to construct learnable global prototypes that are gradually adjusted during training to focus on TLS-related features. Additionally, LPGMIL computes each WSI using multiple learnable global prototypes, effectively capturing diverse TLS pathological patterns and refining the representation of complex TLS features. Unlike previous methods evaluated on single cancer-type datasets, we integrate a six-cancer-type TCGA dataset to better reflect the diversity and complexity of real-world clinical cases. Experimental results and visualizations show that LPGMIL outperforms other compared methods on a six-cancer-type TCGA dataset, achieving 76.6 % accuracy, 74.1 % recall, 82.7 % F1-score, and 83.5 % AUC, demonstrating its effectiveness in addressing the sparsity and heterogeneity of TLS. Code is available at: <span><span>https://github.com/FPXMU/LPGMIL</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"104 ","pages":"Article 103652"},"PeriodicalIF":10.7,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144168329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Saisai Ding , Linjin Li , Ge Jin , Jun Wang , Shihui Ying , Jun Shi
{"title":"HGMSurvNet: A two-stage hypergraph learning network for multimodal cancer survival prediction","authors":"Saisai Ding , Linjin Li , Ge Jin , Jun Wang , Shihui Ying , Jun Shi","doi":"10.1016/j.media.2025.103661","DOIUrl":"10.1016/j.media.2025.103661","url":null,"abstract":"<div><div>Cancer survival prediction based on multimodal data (e.g., pathological slides, clinical records, and genomic profiles) has become increasingly prevalent in recent years. A key challenge of this task is obtaining an effective survival-specific global representation from patient data with highly complicated correlations. Furthermore, the absence of certain modalities is a common issue in clinical practice, which renders current multimodal methods either outdated or ineffective. This article proposes a novel two-stage hypergraph learning network, called HGMSurvNet, for multimodal cancer survival prediction. HGMSurvNet can gradually learn the higher-order global representations from the WSI-level to the patient-level for multimodal learning via multilateral correlation modeling in multiple stages. Most importantly, to address the data noise and missing modalities issues in clinical scenarios, we develop a new hypergraph convolution network with a hyperedge dropout mechanism to discard unimportant hyperedges during model training. Extensive validation experiments were conducted on six public cancer cohorts from TCGA. The results demonstrated that the proposed method consistently outperforms state-of-the-art methods. We also demonstrate the interpretable analysis of HGMSurvNet and its application potential in pathological images and patient modeling, which has valuable clinical significance for the survival prognosis.</div></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"104 ","pages":"Article 103661"},"PeriodicalIF":10.7,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}