Evaluating the robustness of deep learning models trained to diagnose idiopathic pulmonary fibrosis using a retrospective study.

Medical physics Pub Date : 2025-03-20 DOI:10.1002/mp.17752
Wenxi Yu, Michael F McNitt-Gray, Jonathan G Goldin, Jin Woo Song, Grace Hyun J Kim
{"title":"Evaluating the robustness of deep learning models trained to diagnose idiopathic pulmonary fibrosis using a retrospective study.","authors":"Wenxi Yu, Michael F McNitt-Gray, Jonathan G Goldin, Jin Woo Song, Grace Hyun J Kim","doi":"10.1002/mp.17752","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Deep learning (DL)-based systems have not yet been broadly implemented in clinical practice, in part due to unknown robustness across multiple imaging protocols.</p><p><strong>Purpose: </strong>To this end, we aim to evaluate the performance of several previously developed DL-based models, which were trained to distinguish idiopathic pulmonary fibrosis (IPF) from non-IPF among interstitial lung disease (ILD) patients, under standardized reference CT imaging protocols. In this study, we utilized CT scans from non-IPF ILD subjects, acquired using various imaging protocols, to assess the model performance.</p><p><strong>Methods: </strong>Three DL-based models, including one 2D and two 3D models, have been previously developed to classify ILD patients into IPF or non-IPF based on chest CT scans. These models were trained on CT image data from 389 IPF and 700 non-IPF ILD patients, retrospectively, obtained from five multicenter studies. For some patients, multiple CT scans were acquired (e.g., one at inhalation and one at exhalation) and/or reconstructed (e.g., thin slice and/or thick slice). Thus, for each patient, one CT image dataset was selected to be used in the construction of the classification model, so the parameters of that data set serve as the reference conditions. In one non-IPF ILD study, due to its specific study protocol, many patients had multiple CT image data sets that were acquired under both prone and supine positions and/or reconstructed under different imaging parameters. Therefore, to assess the robustness of the previously developed models under different (e.g., non-reference) imaging protocols, we identified 343 subjects from this study who had CT data from both the reference condition (used in model construction) and non-reference conditions (e.g., evaluation conditions), which we used in this model evaluation analysis. We reported the specificities from three model under the non-reference conditions. Generalized linear mixed effects model (GLMM) was utilized to identify the significant CT technical and clinical parameters that were associated with getting inconsistent diagnostic results between reference and evaluation conditions. Selected parameters include effective tube current-time product (known as \"effective mAs\"), reconstruction kernels, slice thickness, patient orientation (prone or supine), CT scanner model, and clinical diagnosis. Limitations include the retrospective nature of this study.</p><p><strong>Results: </strong>For all three DL models, the overall specificity of the previously trained IPF diagnosis model decreased (p < 0.05 for two out of three models). GLMM further suggests that for at least one out of three models, mean effective mAs across the scan is the key factor that leads to the decrease in model predictive performance (p < 0.001); the difference of mean effective mAs between the reference and evaluation conditions (p = 0.03) and slice thickness (3 mm; p = 0.03) are flagged as significant factors for one out of three models; other factors are not statistically significant (p > 0.05).</p><p><strong>Conclusion: </strong>Preliminary findings demonstrated the lack of robustness of IPF diagnosis model when the DL-based model is applied to CT series collected under different imaging protocols, which indicated that care should be taken as to the acquisition and reconstruction conditions used when developing and deploying DL models into clinical practice.</p>","PeriodicalId":94136,"journal":{"name":"Medical physics","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical physics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1002/mp.17752","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Deep learning (DL)-based systems have not yet been broadly implemented in clinical practice, in part due to unknown robustness across multiple imaging protocols.

Purpose: To this end, we aim to evaluate the performance of several previously developed DL-based models, which were trained to distinguish idiopathic pulmonary fibrosis (IPF) from non-IPF among interstitial lung disease (ILD) patients, under standardized reference CT imaging protocols. In this study, we utilized CT scans from non-IPF ILD subjects, acquired using various imaging protocols, to assess the model performance.

Methods: Three DL-based models, including one 2D and two 3D models, have been previously developed to classify ILD patients into IPF or non-IPF based on chest CT scans. These models were trained on CT image data from 389 IPF and 700 non-IPF ILD patients, retrospectively, obtained from five multicenter studies. For some patients, multiple CT scans were acquired (e.g., one at inhalation and one at exhalation) and/or reconstructed (e.g., thin slice and/or thick slice). Thus, for each patient, one CT image dataset was selected to be used in the construction of the classification model, so the parameters of that data set serve as the reference conditions. In one non-IPF ILD study, due to its specific study protocol, many patients had multiple CT image data sets that were acquired under both prone and supine positions and/or reconstructed under different imaging parameters. Therefore, to assess the robustness of the previously developed models under different (e.g., non-reference) imaging protocols, we identified 343 subjects from this study who had CT data from both the reference condition (used in model construction) and non-reference conditions (e.g., evaluation conditions), which we used in this model evaluation analysis. We reported the specificities from three model under the non-reference conditions. Generalized linear mixed effects model (GLMM) was utilized to identify the significant CT technical and clinical parameters that were associated with getting inconsistent diagnostic results between reference and evaluation conditions. Selected parameters include effective tube current-time product (known as "effective mAs"), reconstruction kernels, slice thickness, patient orientation (prone or supine), CT scanner model, and clinical diagnosis. Limitations include the retrospective nature of this study.

Results: For all three DL models, the overall specificity of the previously trained IPF diagnosis model decreased (p < 0.05 for two out of three models). GLMM further suggests that for at least one out of three models, mean effective mAs across the scan is the key factor that leads to the decrease in model predictive performance (p < 0.001); the difference of mean effective mAs between the reference and evaluation conditions (p = 0.03) and slice thickness (3 mm; p = 0.03) are flagged as significant factors for one out of three models; other factors are not statistically significant (p > 0.05).

Conclusion: Preliminary findings demonstrated the lack of robustness of IPF diagnosis model when the DL-based model is applied to CT series collected under different imaging protocols, which indicated that care should be taken as to the acquisition and reconstruction conditions used when developing and deploying DL models into clinical practice.

求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信