Mena Shenouda, Heather M Whitney, Maryellen L Giger, Samuel G Armato
{"title":"Impact of retraining and data partitions on the generalizability of a deep learning model in the task of COVID-19 classification on chest radiographs.","authors":"Mena Shenouda, Heather M Whitney, Maryellen L Giger, Samuel G Armato","doi":"10.1117/1.JMI.11.6.064503","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>This study aimed to investigate the impact of different model retraining schemes and data partitioning on model performance in the task of COVID-19 classification on standard chest radiographs (CXRs), in the context of model generalizability.</p><p><strong>Approach: </strong>Two datasets from the same institution were used: Set A (9860 patients, collected from 02/20/2020 to 02/03/2021) and Set B (5893 patients, collected from 03/15/2020 to 01/01/2022). An original deep learning (DL) model trained and tested in the task of COVID-19 classification using the initial partition of Set A achieved an area under the curve (AUC) value of 0.76, whereas Set B yielded a significantly lower value of 0.67. To explore this discrepancy, four separate strategies were undertaken on the original model: (1) retrain using Set B, (2) fine-tune using Set B, (3) <math><mrow><mi>L</mi> <mn>2</mn></mrow> </math> regularization, and (4) repartition of the training set from Set A 200 times and report AUC values.</p><p><strong>Results: </strong>The model achieved the following AUC values (95% confidence interval) for the four methods: (1) 0.61 [0.56, 0.66]; (2) 0.70 [0.66, 0.73], both on Set B; (3) 0.76 [0.72, 0.79] on the initial test partition of Set A and 0.68 [0.66, 0.70] on Set B; and (4) <math><mrow><mn>0.71</mn> <mo>±</mo> <mn>0.013</mn></mrow> </math> on repartitions of Set A. The lowest AUC value (0.66 [0.62, 0.69]) of the Set A repartitions was no longer significantly different from the initial 0.67 achieved on Set B.</p><p><strong>Conclusions: </strong>Different data repartitions of the same dataset used to train a DL model demonstrated significantly different performance values that helped explain the discrepancy between Set A and Set B and further demonstrated the limitations of model generalizability.</p>","PeriodicalId":47707,"journal":{"name":"Journal of Medical Imaging","volume":"11 6","pages":"064503"},"PeriodicalIF":1.9000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11670362/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Imaging","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1117/1.JMI.11.6.064503","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/26 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose: This study aimed to investigate the impact of different model retraining schemes and data partitioning on model performance in the task of COVID-19 classification on standard chest radiographs (CXRs), in the context of model generalizability.
Approach: Two datasets from the same institution were used: Set A (9860 patients, collected from 02/20/2020 to 02/03/2021) and Set B (5893 patients, collected from 03/15/2020 to 01/01/2022). An original deep learning (DL) model trained and tested in the task of COVID-19 classification using the initial partition of Set A achieved an area under the curve (AUC) value of 0.76, whereas Set B yielded a significantly lower value of 0.67. To explore this discrepancy, four separate strategies were undertaken on the original model: (1) retrain using Set B, (2) fine-tune using Set B, (3) regularization, and (4) repartition of the training set from Set A 200 times and report AUC values.
Results: The model achieved the following AUC values (95% confidence interval) for the four methods: (1) 0.61 [0.56, 0.66]; (2) 0.70 [0.66, 0.73], both on Set B; (3) 0.76 [0.72, 0.79] on the initial test partition of Set A and 0.68 [0.66, 0.70] on Set B; and (4) on repartitions of Set A. The lowest AUC value (0.66 [0.62, 0.69]) of the Set A repartitions was no longer significantly different from the initial 0.67 achieved on Set B.
Conclusions: Different data repartitions of the same dataset used to train a DL model demonstrated significantly different performance values that helped explain the discrepancy between Set A and Set B and further demonstrated the limitations of model generalizability.
期刊介绍:
JMI covers fundamental and translational research, as well as applications, focused on medical imaging, which continue to yield physical and biomedical advancements in the early detection, diagnostics, and therapy of disease as well as in the understanding of normal. The scope of JMI includes: Imaging physics, Tomographic reconstruction algorithms (such as those in CT and MRI), Image processing and deep learning, Computer-aided diagnosis and quantitative image analysis, Visualization and modeling, Picture archiving and communications systems (PACS), Image perception and observer performance, Technology assessment, Ultrasonic imaging, Image-guided procedures, Digital pathology, Biomedical applications of biomedical imaging. JMI allows for the peer-reviewed communication and archiving of scientific developments, translational and clinical applications, reviews, and recommendations for the field.