{"title":"Meta-analysis of AI-based pulmonary embolism detection: How reliable are deep learning models?","authors":"Ezio Lanza , Angela Ammirabile , Marco Francone","doi":"10.1016/j.compbiomed.2025.110402","DOIUrl":null,"url":null,"abstract":"<div><h3>Rationale and objectives</h3><div>Deep learning (DL)–based methods show promise in detecting pulmonary embolism (PE) on CT pulmonary angiography (CTPA), potentially improving diagnostic accuracy and workflow efficiency. This meta-analysis aimed to (1) determine pooled performance estimates of DL algorithms for PE detection; and (2) compare the diagnostic efficacy of convolutional neural network (CNN)– versus U-Net–based architectures.</div></div><div><h3>Materials and methods</h3><div>Following PRISMA guidelines, we searched PubMed and EMBASE through April 15, 2025 for English‐language studies (2010–2025) reporting DL models for PE detection with extractable 2 × 2 data or performance metrics. True/false positives and negatives were reconstructed when necessary under an assumed 50 % PE prevalence (with 0.5 continuity correction). We approximated AUROC as the mean of sensitivity and specificity if not directly reported. Sensitivity, specificity, accuracy, PPV and NPV were pooled using a DerSimonian–Laird random-effects model with Freeman-Tukey transformation; AUROC values were combined via a fixed-effect inverse-variance approach. Heterogeneity was assessed by Cochran's Q and I<sup>2</sup>. Subgroup analyses contrasted CNN versus U-Net models.</div></div><div><h3>Results</h3><div>Twenty-four studies (n = 22,984 patients) met inclusion criteria. Pooled estimates were: AUROC 0.895 (95 % CI: 0.874–0.917), sensitivity 0.894 (0.856–0.923), specificity 0.871 (0.831–0.903), accuracy 0.857 (0.833–0.882), PPV 0.832 (0.794–0.869) and NPV 0.902 (0.874–0.929). Between-study heterogeneity was high (I<sup>2</sup> ≈ 97 % for sensitivity/specificity). U-Net models exhibited higher sensitivity (0.899 vs 0.893) and CNN models higher specificity (0.926 vs 0.900); subgroup Q‐tests confirmed significant differences for both sensitivity (p = 0.0002) and specificity (p < 0.001).</div></div><div><h3>Conclusions</h3><div>DL algorithms demonstrate high diagnostic accuracy for PE detection on CTPA, with complementary strengths: U-Net architectures excel in true-positive identification, whereas CNNs yield fewer false positives. However, marked heterogeneity underscores the need for standardized, prospective validation before routine clinical implementation.</div></div>","PeriodicalId":10578,"journal":{"name":"Computers in biology and medicine","volume":"193 ","pages":"Article 110402"},"PeriodicalIF":7.0000,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers in biology and medicine","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S001048252500753X","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Rationale and objectives
Deep learning (DL)–based methods show promise in detecting pulmonary embolism (PE) on CT pulmonary angiography (CTPA), potentially improving diagnostic accuracy and workflow efficiency. This meta-analysis aimed to (1) determine pooled performance estimates of DL algorithms for PE detection; and (2) compare the diagnostic efficacy of convolutional neural network (CNN)– versus U-Net–based architectures.
Materials and methods
Following PRISMA guidelines, we searched PubMed and EMBASE through April 15, 2025 for English‐language studies (2010–2025) reporting DL models for PE detection with extractable 2 × 2 data or performance metrics. True/false positives and negatives were reconstructed when necessary under an assumed 50 % PE prevalence (with 0.5 continuity correction). We approximated AUROC as the mean of sensitivity and specificity if not directly reported. Sensitivity, specificity, accuracy, PPV and NPV were pooled using a DerSimonian–Laird random-effects model with Freeman-Tukey transformation; AUROC values were combined via a fixed-effect inverse-variance approach. Heterogeneity was assessed by Cochran's Q and I2. Subgroup analyses contrasted CNN versus U-Net models.
Results
Twenty-four studies (n = 22,984 patients) met inclusion criteria. Pooled estimates were: AUROC 0.895 (95 % CI: 0.874–0.917), sensitivity 0.894 (0.856–0.923), specificity 0.871 (0.831–0.903), accuracy 0.857 (0.833–0.882), PPV 0.832 (0.794–0.869) and NPV 0.902 (0.874–0.929). Between-study heterogeneity was high (I2 ≈ 97 % for sensitivity/specificity). U-Net models exhibited higher sensitivity (0.899 vs 0.893) and CNN models higher specificity (0.926 vs 0.900); subgroup Q‐tests confirmed significant differences for both sensitivity (p = 0.0002) and specificity (p < 0.001).
Conclusions
DL algorithms demonstrate high diagnostic accuracy for PE detection on CTPA, with complementary strengths: U-Net architectures excel in true-positive identification, whereas CNNs yield fewer false positives. However, marked heterogeneity underscores the need for standardized, prospective validation before routine clinical implementation.
期刊介绍:
Computers in Biology and Medicine is an international forum for sharing groundbreaking advancements in the use of computers in bioscience and medicine. This journal serves as a medium for communicating essential research, instruction, ideas, and information regarding the rapidly evolving field of computer applications in these domains. By encouraging the exchange of knowledge, we aim to facilitate progress and innovation in the utilization of computers in biology and medicine.