Bogdan Obreja, Joeran Bosma, Kiran Vaidhya Venkadesh, Zaigham Saghir, Mathias Prokop, Colin Jacobs
求助PDF
{"title":"Characterizing the Impact of Training Data on Generalizability: Application in Deep Learning to Estimate Lung Nodule Malignancy Risk.","authors":"Bogdan Obreja, Joeran Bosma, Kiran Vaidhya Venkadesh, Zaigham Saghir, Mathias Prokop, Colin Jacobs","doi":"10.1148/ryai.240636","DOIUrl":null,"url":null,"abstract":"<p><p>Purpose To investigate the relationship between training data volume and performance of a deep learning AI algorithm developed to assess the malignancy risk of pulmonary nodules detected on low-dose CT scans in lung cancer screening. Materials and Methods This retrospective study used a dataset of 16077 annotated nodules (1249 malignant, 14828 benign) from the National Lung Screening Trial (NLST) to systematically train an AI algorithm for pulmonary nodule malignancy risk prediction across various stratified subsets ranging from 1.25% to the full dataset. External testing was conducted using data from the Danish Lung Cancer Screening Trial (DLCST) to determine the amount of training data at which the performance of the AI was statistically non-inferior to the AI trained on the full NLST cohort. A size-matched cancer-enriched subset of DLCST, where each malignant nodule had been paired in diameter with the closest two benign nodules, was used to investigate the amount of training data at which the performance of the AI algorithm was statistically non-inferior to the average performance of 11 clinicians. Results The external testing set included 599 participants (mean age 57.65 (SD 4.84) for females and mean age 59.03 (SD 4.94) for males) with 883 nodules (65 malignant, 818 benign). The AI achieved a mean AUC of 0.92 [95% CI: 0.88, 0.96] on the DLCST cohort when trained on the full NLST dataset. Training with 80% of NLST data resulted in non-inferior performance (mean AUC 0.92 [95%CI: 0.89, 0.96], <i>P</i> = .005). On the size-matched DLCST subset (59 malignant, 118 benign), the AI reached non-inferior clinician-level performance (mean AUC 0.82 [95% CI: 0.77, 0.86]) with 20% of the training data (<i>P</i> = .02). Conclusion The deep learning AI algorithm demonstrated excellent performance in assessing pulmonary nodule malignancy risk, achieving clinical level performance with a fraction of the training data and reaching peak performance before utilizing the full dataset. ©RSNA, 2025.</p>","PeriodicalId":29787,"journal":{"name":"Radiology-Artificial Intelligence","volume":" ","pages":"e240636"},"PeriodicalIF":13.2000,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radiology-Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1148/ryai.240636","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
引用
批量引用
Abstract
Purpose To investigate the relationship between training data volume and performance of a deep learning AI algorithm developed to assess the malignancy risk of pulmonary nodules detected on low-dose CT scans in lung cancer screening. Materials and Methods This retrospective study used a dataset of 16077 annotated nodules (1249 malignant, 14828 benign) from the National Lung Screening Trial (NLST) to systematically train an AI algorithm for pulmonary nodule malignancy risk prediction across various stratified subsets ranging from 1.25% to the full dataset. External testing was conducted using data from the Danish Lung Cancer Screening Trial (DLCST) to determine the amount of training data at which the performance of the AI was statistically non-inferior to the AI trained on the full NLST cohort. A size-matched cancer-enriched subset of DLCST, where each malignant nodule had been paired in diameter with the closest two benign nodules, was used to investigate the amount of training data at which the performance of the AI algorithm was statistically non-inferior to the average performance of 11 clinicians. Results The external testing set included 599 participants (mean age 57.65 (SD 4.84) for females and mean age 59.03 (SD 4.94) for males) with 883 nodules (65 malignant, 818 benign). The AI achieved a mean AUC of 0.92 [95% CI: 0.88, 0.96] on the DLCST cohort when trained on the full NLST dataset. Training with 80% of NLST data resulted in non-inferior performance (mean AUC 0.92 [95%CI: 0.89, 0.96], P = .005). On the size-matched DLCST subset (59 malignant, 118 benign), the AI reached non-inferior clinician-level performance (mean AUC 0.82 [95% CI: 0.77, 0.86]) with 20% of the training data (P = .02). Conclusion The deep learning AI algorithm demonstrated excellent performance in assessing pulmonary nodule malignancy risk, achieving clinical level performance with a fraction of the training data and reaching peak performance before utilizing the full dataset. ©RSNA, 2025.
表征训练数据对泛化性的影响:在深度学习中估计肺结节恶性肿瘤风险的应用。
“刚刚接受”的论文经过了全面的同行评审,并已被接受发表在《放射学:人工智能》杂志上。这篇文章将经过编辑,布局和校样审查,然后在其最终版本出版。请注意,在最终编辑文章的制作过程中,可能会发现可能影响内容的错误。目的研究一种深度学习人工智能算法的训练数据量与性能之间的关系,该算法用于评估肺癌筛查中低剂量CT扫描检测到的肺结节的恶性风险。本回顾性研究使用来自国家肺筛查试验(NLST)的16077个带注释的结节(1249个为恶性,14828个为良性)的数据集,系统地训练了一种AI算法,用于从1.25%到完整数据集的不同分层亚群的肺结节恶性风险预测。使用来自丹麦肺癌筛查试验(DLCST)的数据进行外部测试,以确定人工智能的性能在统计上不逊于在NLST全队列中训练的人工智能的训练数据量。DLCST的一个大小匹配的癌症富集子集,其中每个恶性结节的直径与最接近的两个良性结节配对,用于研究人工智能算法的性能在统计上不低于11名临床医生的平均性能的训练数据量。结果共纳入599名参与者(女性平均年龄57.65岁(SD 4.84),男性平均年龄59.03岁(SD 4.94)),其中883例结节(65例恶性,818例良性)。当在完整NLST数据集上训练时,AI在DLCST队列上的平均AUC为0.92 [95% CI: 0.88, 0.96]。使用80% NLST数据的训练结果不差(平均AUC 0.92 [95%CI: 0.89, 0.96], P = 0.005)。在大小匹配的DLCST子集(59例恶性,118例良性)中,人工智能在20%的训练数据(P = 0.02)下达到了非劣临床水平的表现(平均AUC 0.82 [95% CI: 0.77, 0.86])。结论深度学习人工智能算法在评估肺结节恶性肿瘤风险方面表现出色,使用一小部分训练数据即可达到临床水平,并且在使用完整数据集之前达到峰值。©RSNA, 2025年。
本文章由计算机程序翻译,如有差异,请以英文原文为准。