Are Artificial Intelligence Models Reliable for Clinical Application in Pediatric Fracture Detection on Radiographs? A Systematic Review and Meta-analysis.

IF 4.4 2区医学 Q1 ORTHOPEDICS

Clinical Orthopaedics and Related Research® Pub Date : 2025-08-20 DOI:10.1097/corr.0000000000003660

Gabriel Fontenele Ximenes,Átila Lobo Costa,Letícia Lima Leite,Lucas Lopes Costa,Matheus Oliveira Ribeiro,Paulo Giordano Baima Colares,Gilberto Santos Cerqueira

{"title":"Are Artificial Intelligence Models Reliable for Clinical Application in Pediatric Fracture Detection on Radiographs? A Systematic Review and Meta-analysis.","authors":"Gabriel Fontenele Ximenes,Átila Lobo Costa,Letícia Lima Leite,Lucas Lopes Costa,Matheus Oliveira Ribeiro,Paulo Giordano Baima Colares,Gilberto Santos Cerqueira","doi":"10.1097/corr.0000000000003660","DOIUrl":null,"url":null,"abstract":"BACKGROUND\r\nArtificial intelligence (AI) applications for pediatric fracture diagnosis using radiographs have demonstrated growing potential in clinical settings. Despite this growing potential, existing studies are limited by small sample sizes, variability in their diagnostic metrics, and inconsistent use of external validation, which reduces confidence in their findings. These limitations hinder the assessment of real-world performance. A meta-analysis would help address these gaps by pooling data to generate more robust, generalizable estimates for clinical application and future guidance.\r\n\r\nQUESTIONS/PURPOSES\r\n(1) What is the pooled diagnostic performance of AI models, including sensitivity, specificity, and area under the curve (AUC), for detecting pediatric fractures on radiographs? (2) What is the clinical applicability of AI models, as determined by whether their diagnostic performance is sustained in studies that employed external validation? (3) How does anatomic coverage influence the diagnostic performance of AI models?\r\n\r\nMETHODS\r\nThis meta-analysis adhered to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 guidelines and was registered in PROSPERO (CRD42024628342). A systematic search of PubMed/MEDLINE, Embase, and the Cochrane Library was conducted from database inception through December 9, 2024. A total of 497 records were identified. Eligible studies included pediatric patients with suspected fractures evaluated by AI models on radiographs. Studies were excluded if they lacked sufficient data to calculate sensitivity, specificity, or AUC; if they combined adult and pediatric populations; or if they focused on rib fractures. Sixteen diagnostic accuracy studies were included, involving 10,203 pediatric patients with a mean age of 8.85 years, 54% of whom were male, and 21,789 radiographs, of which 5882 confirmed fractures. Data extraction followed the Population, Index test, Target condition (PIT) framework and was performed independently by two reviewers. The risk of bias was assessed using the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool, which evaluates four domains (patient selection, index test, reference standard, and flow/timing) for low, high, or unclear risk. Most studies exhibited low to moderate risk of bias. Certainty of evidence was evaluated using the Grading of Recommendations Assessment, Development, and Evaluation (GRADE) approach, which classifies evidence as high, moderate, low, or very low, and in this study demonstrated high certainty of evidence. Heterogeneity in the pooled estimates was moderate for sensitivity (I2 = 61%) and high for specificity (I2 = 90%). No evidence of publication bias was detected based on Egger test (p = 0.54) and funnel plot symmetry. Meta-analyses used logit transformation and bivariate modeling to estimate pooled sensitivity, specificity, and AUC.\r\n\r\nRESULTS\r\nThe pooled analysis demonstrated a sensitivity of 93% (95% confidence interval [CI] 92% to 94%), a specificity of 91% (95% CI 88% to 93%), and an AUC of 0.96 (95% CI 0.92 to 0.97). The AUC reflects the overall ability of a model to distinguish between patients with and without fractures, with values closer to 1.0 indicating better diagnostic performance. When evaluated on external data sets, AI models maintained high diagnostic accuracy, with a sensitivity of 93% (95% CI 90% to 95%), specificity of 88% (95% CI 84% to 91%), and an AUC of 0.95 (95% CI 0.89 to 0.97), supporting their potential for clinical applicability. Anatomic coverage by specific region made a meaningful contribution to explaining the observed heterogeneity. Models evaluating multiple regions showed slightly higher sensitivity, while those focused on single regions demonstrated better specificity, suggesting that a broader anatomic scope may improve fracture detection but slightly reduce accuracy in ruling out false positives.\r\n\r\nCONCLUSION\r\nThis meta-analysis demonstrates that AI models can accurately detect pediatric fractures on radiographs, a finding that withstood scrutiny in studies that included external validation. These findings suggest that orthopaedic surgeons and emergency physicians can consider incorporating validated convolutional neural network algorithms into workflows to enhance diagnostic accuracy, especially in acute care settings where rapid and accurate decision-making is critical. Nevertheless, future research is needed to investigate performance across specific subgroups, including sex and anatomic regions. Paired-design diagnostic accuracy studies with external geographic validation remain the most appropriate method to assess their real-world value. Such validation should be prioritized as a prerequisite for clinical generalization and democratization of AI models, even before randomized trials or prospective implementation studies.\r\n\r\nLEVEL OF EVIDENCE\r\nLevel III, diagnostic study.","PeriodicalId":10404,"journal":{"name":"Clinical Orthopaedics and Related Research®","volume":"29 1","pages":""},"PeriodicalIF":4.4000,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical Orthopaedics and Related Research®","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/corr.0000000000003660","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ORTHOPEDICS","Score":null,"Total":0}

引用次数: 0

Abstract

BACKGROUND Artificial intelligence (AI) applications for pediatric fracture diagnosis using radiographs have demonstrated growing potential in clinical settings. Despite this growing potential, existing studies are limited by small sample sizes, variability in their diagnostic metrics, and inconsistent use of external validation, which reduces confidence in their findings. These limitations hinder the assessment of real-world performance. A meta-analysis would help address these gaps by pooling data to generate more robust, generalizable estimates for clinical application and future guidance. QUESTIONS/PURPOSES (1) What is the pooled diagnostic performance of AI models, including sensitivity, specificity, and area under the curve (AUC), for detecting pediatric fractures on radiographs? (2) What is the clinical applicability of AI models, as determined by whether their diagnostic performance is sustained in studies that employed external validation? (3) How does anatomic coverage influence the diagnostic performance of AI models? METHODS This meta-analysis adhered to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 guidelines and was registered in PROSPERO (CRD42024628342). A systematic search of PubMed/MEDLINE, Embase, and the Cochrane Library was conducted from database inception through December 9, 2024. A total of 497 records were identified. Eligible studies included pediatric patients with suspected fractures evaluated by AI models on radiographs. Studies were excluded if they lacked sufficient data to calculate sensitivity, specificity, or AUC; if they combined adult and pediatric populations; or if they focused on rib fractures. Sixteen diagnostic accuracy studies were included, involving 10,203 pediatric patients with a mean age of 8.85 years, 54% of whom were male, and 21,789 radiographs, of which 5882 confirmed fractures. Data extraction followed the Population, Index test, Target condition (PIT) framework and was performed independently by two reviewers. The risk of bias was assessed using the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool, which evaluates four domains (patient selection, index test, reference standard, and flow/timing) for low, high, or unclear risk. Most studies exhibited low to moderate risk of bias. Certainty of evidence was evaluated using the Grading of Recommendations Assessment, Development, and Evaluation (GRADE) approach, which classifies evidence as high, moderate, low, or very low, and in this study demonstrated high certainty of evidence. Heterogeneity in the pooled estimates was moderate for sensitivity (I2 = 61%) and high for specificity (I2 = 90%). No evidence of publication bias was detected based on Egger test (p = 0.54) and funnel plot symmetry. Meta-analyses used logit transformation and bivariate modeling to estimate pooled sensitivity, specificity, and AUC. RESULTS The pooled analysis demonstrated a sensitivity of 93% (95% confidence interval [CI] 92% to 94%), a specificity of 91% (95% CI 88% to 93%), and an AUC of 0.96 (95% CI 0.92 to 0.97). The AUC reflects the overall ability of a model to distinguish between patients with and without fractures, with values closer to 1.0 indicating better diagnostic performance. When evaluated on external data sets, AI models maintained high diagnostic accuracy, with a sensitivity of 93% (95% CI 90% to 95%), specificity of 88% (95% CI 84% to 91%), and an AUC of 0.95 (95% CI 0.89 to 0.97), supporting their potential for clinical applicability. Anatomic coverage by specific region made a meaningful contribution to explaining the observed heterogeneity. Models evaluating multiple regions showed slightly higher sensitivity, while those focused on single regions demonstrated better specificity, suggesting that a broader anatomic scope may improve fracture detection but slightly reduce accuracy in ruling out false positives. CONCLUSION This meta-analysis demonstrates that AI models can accurately detect pediatric fractures on radiographs, a finding that withstood scrutiny in studies that included external validation. These findings suggest that orthopaedic surgeons and emergency physicians can consider incorporating validated convolutional neural network algorithms into workflows to enhance diagnostic accuracy, especially in acute care settings where rapid and accurate decision-making is critical. Nevertheless, future research is needed to investigate performance across specific subgroups, including sex and anatomic regions. Paired-design diagnostic accuracy studies with external geographic validation remain the most appropriate method to assess their real-world value. Such validation should be prioritized as a prerequisite for clinical generalization and democratization of AI models, even before randomized trials or prospective implementation studies. LEVEL OF EVIDENCE Level III, diagnostic study.

查看原文本刊更多论文

人工智能模型在儿童骨折x线片检测中的临床应用可靠吗？系统回顾和荟萃分析。

人工智能（AI）在儿童骨折诊断中的应用已经在临床环境中显示出越来越大的潜力。尽管这种潜力越来越大，但现有的研究受到样本量小、诊断指标的可变性以及外部验证使用不一致的限制，这降低了对研究结果的信心。这些限制阻碍了对实际性能的评估。荟萃分析将通过汇集数据，为临床应用和未来指导提供更可靠、更普遍的估计，从而有助于解决这些差距。(1)在x线片上检测儿童骨折时，人工智能模型的综合诊断性能是什么，包括敏感性、特异性和曲线下面积（AUC）？(2)人工智能模型的临床适用性是什么，取决于它们的诊断性能在采用外部验证的研究中是否持续？(3)解剖覆盖如何影响AI模型的诊断性能？方法：本荟萃分析遵循系统评价和荟萃分析首选报告项目（PRISMA） 2020指南，并在PROSPERO注册（CRD42024628342）。从数据库建立到2024年12月9日，对PubMed/MEDLINE、Embase和Cochrane图书馆进行了系统检索。总共确定了497条记录。符合条件的研究包括通过人工智能模型在x线片上评估疑似骨折的儿科患者。如果缺乏足够的数据来计算敏感性、特异性或AUC，则排除研究；如果他们将成人和儿童人口结合起来；或者他们关注的是肋骨骨折。纳入16项诊断准确性研究，涉及10,203例儿童患者，平均年龄8.85岁，其中54%为男性，21,789张x线片，其中5882张确诊骨折。数据提取遵循总体、指数测试、目标条件（PIT）框架，并由两位审稿人独立完成。使用诊断准确性研究质量评估（QUADAS-2）工具评估偏倚风险，该工具评估四个领域（患者选择、指标测试、参考标准和流量/时间）的低、高或不明确风险。大多数研究显示低至中等偏倚风险。证据的确定性采用推荐评估、发展和评价分级（GRADE）方法进行评估，该方法将证据分为高、中、低或极低，在本研究中证明了证据的高确定性。汇总估计的异质性在敏感性（I2 = 61%）和特异性（I2 = 90%）方面均为中等。基于Egger检验（p = 0.54）和漏斗图对称，未发现发表偏倚的证据。荟萃分析使用logit变换和双变量模型来估计合并的敏感性、特异性和AUC。结果合并分析显示灵敏度为93%(95%置信区间[CI] 92% ~ 94%)，特异性为91% (95% CI 88% ~ 93%)， AUC为0.96 （95% CI 0.92 ~ 0.97）。AUC反映了模型区分骨折患者和非骨折患者的整体能力，接近1.0的值表示更好的诊断性能。当对外部数据集进行评估时，AI模型保持了很高的诊断准确性，灵敏度为93% (95% CI为90%至95%)，特异性为88% (95% CI为84%至91%)，AUC为0.95 (95% CI为0.89至0.97)，支持其临床适用性的潜力。特定区域的解剖覆盖对解释观察到的异质性做出了有意义的贡献。评估多个区域的模型灵敏度略高，而专注于单个区域的模型具有更好的特异性，这表明更广泛的解剖范围可能会提高骨折检测，但会略微降低排除假阳性的准确性。本荟萃分析表明，人工智能模型可以准确地检测出x线片上的儿童骨折，这一发现经受住了包括外部验证在内的研究的审查。这些发现表明，骨科医生和急诊医生可以考虑将经过验证的卷积神经网络算法纳入工作流程，以提高诊断准确性，特别是在急症护理环境中，快速准确的决策至关重要。然而，未来的研究需要调查特定亚群的表现，包括性别和解剖区域。具有外部地理验证的配对设计诊断准确性研究仍然是评估其实际价值的最合适方法。这种验证应该优先作为人工智能模型临床推广和民主化的先决条件，甚至在随机试验或前瞻性实施研究之前。证据等级：诊断性研究III级。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Clinical Orthopaedics and Related Research® 医学-外科

CiteScore

7.00

自引率

11.90%

发文量

722

审稿时长

2.5 months

期刊介绍： Clinical Orthopaedics and Related Research® is a leading peer-reviewed journal devoted to the dissemination of new and important orthopaedic knowledge. CORR® brings readers the latest clinical and basic research, along with columns, commentaries, and interviews with authors.