Evaluating the robustness of deep learning models trained to diagnose idiopathic pulmonary fibrosis using a retrospective study

IF 3.2 2区医学 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

Medical physics Pub Date : 2025-03-20 DOI:10.1002/mp.17752

Wenxi Yu, Michael F. McNitt-Gray, Jonathan G Goldin, Jin Woo Song, Grace Hyun J. Kim

{"title":"Evaluating the robustness of deep learning models trained to diagnose idiopathic pulmonary fibrosis using a retrospective study","authors":"Wenxi Yu, Michael F. McNitt-Gray, Jonathan G Goldin, Jin Woo Song, Grace Hyun J. Kim","doi":"10.1002/mp.17752","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Background</h3>\n \n <p>Deep learning (DL)-based systems have not yet been broadly implemented in clinical practice, in part due to unknown robustness across multiple imaging protocols.</p>\n </section>\n \n <section>\n \n <h3> Purpose</h3>\n \n <p>To this end, we aim to evaluate the performance of several previously developed DL-based models, which were trained to distinguish idiopathic pulmonary fibrosis (IPF) from non-IPF among interstitial lung disease (ILD) patients, under standardized reference CT imaging protocols. In this study, we utilized CT scans from non-IPF ILD subjects, acquired using various imaging protocols, to assess the model performance.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>Three DL-based models, including one 2D and two 3D models, have been previously developed to classify ILD patients into IPF or non-IPF based on chest CT scans. These models were trained on CT image data from 389 IPF and 700 non-IPF ILD patients, retrospectively, obtained from five multicenter studies. For some patients, multiple CT scans were acquired (e.g., one at inhalation and one at exhalation) and/or reconstructed (e.g., thin slice and/or thick slice). Thus, for each patient, one CT image dataset was selected to be used in the construction of the classification model, so the parameters of that data set serve as the <i>reference conditions</i>. In one non-IPF ILD study, due to its specific study protocol, many patients had multiple CT image data sets that were acquired under both prone and supine positions and/or reconstructed under different imaging parameters. Therefore, to assess the robustness of the previously developed models under different (e.g., non-reference) imaging protocols, we identified 343 subjects from this study who had CT data from both the reference condition (used in model construction) and non-reference conditions (e.g., <i>evaluation conditions</i>), which we used in this model evaluation analysis. We reported the specificities from three model under the non-reference conditions. Generalized linear mixed effects model (GLMM) was utilized to identify the significant CT technical and clinical parameters that were associated with getting inconsistent diagnostic results between reference and evaluation conditions. Selected parameters include effective tube current-time product (known as “effective mAs”), reconstruction kernels, slice thickness, patient orientation (prone or supine), CT scanner model, and clinical diagnosis. Limitations include the retrospective nature of this study.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>For all three DL models, the overall specificity of the previously trained IPF diagnosis model decreased (<i>p</i> < 0.05 for two out of three models). GLMM further suggests that for at least one out of three models, mean effective mAs across the scan is the key factor that leads to the decrease in model predictive performance (<i>p</i> < 0.001); the difference of mean effective mAs between the reference and evaluation conditions (<i>p</i> = 0.03) and slice thickness (3 mm; <i>p</i> = 0.03) are flagged as significant factors for one out of three models; other factors are not statistically significant (<i>p</i> > 0.05).</p>\n </section>\n \n <section>\n \n <h3> Conclusion</h3>\n \n <p>Preliminary findings demonstrated the lack of robustness of IPF diagnosis model when the DL-based model is applied to CT series collected under different imaging protocols, which indicated that care should be taken as to the acquisition and reconstruction conditions used when developing and deploying DL models into clinical practice.</p>\n </section>\n </div>","PeriodicalId":18384,"journal":{"name":"Medical physics","volume":"52 6","pages":"4239-4249"},"PeriodicalIF":3.2000,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical physics","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/mp.17752","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

Abstract

Background

Deep learning (DL)-based systems have not yet been broadly implemented in clinical practice, in part due to unknown robustness across multiple imaging protocols.

Purpose

To this end, we aim to evaluate the performance of several previously developed DL-based models, which were trained to distinguish idiopathic pulmonary fibrosis (IPF) from non-IPF among interstitial lung disease (ILD) patients, under standardized reference CT imaging protocols. In this study, we utilized CT scans from non-IPF ILD subjects, acquired using various imaging protocols, to assess the model performance.

Methods

Three DL-based models, including one 2D and two 3D models, have been previously developed to classify ILD patients into IPF or non-IPF based on chest CT scans. These models were trained on CT image data from 389 IPF and 700 non-IPF ILD patients, retrospectively, obtained from five multicenter studies. For some patients, multiple CT scans were acquired (e.g., one at inhalation and one at exhalation) and/or reconstructed (e.g., thin slice and/or thick slice). Thus, for each patient, one CT image dataset was selected to be used in the construction of the classification model, so the parameters of that data set serve as the reference conditions. In one non-IPF ILD study, due to its specific study protocol, many patients had multiple CT image data sets that were acquired under both prone and supine positions and/or reconstructed under different imaging parameters. Therefore, to assess the robustness of the previously developed models under different (e.g., non-reference) imaging protocols, we identified 343 subjects from this study who had CT data from both the reference condition (used in model construction) and non-reference conditions (e.g., evaluation conditions), which we used in this model evaluation analysis. We reported the specificities from three model under the non-reference conditions. Generalized linear mixed effects model (GLMM) was utilized to identify the significant CT technical and clinical parameters that were associated with getting inconsistent diagnostic results between reference and evaluation conditions. Selected parameters include effective tube current-time product (known as “effective mAs”), reconstruction kernels, slice thickness, patient orientation (prone or supine), CT scanner model, and clinical diagnosis. Limitations include the retrospective nature of this study.

Results

For all three DL models, the overall specificity of the previously trained IPF diagnosis model decreased (p < 0.05 for two out of three models). GLMM further suggests that for at least one out of three models, mean effective mAs across the scan is the key factor that leads to the decrease in model predictive performance (p < 0.001); the difference of mean effective mAs between the reference and evaluation conditions (p = 0.03) and slice thickness (3 mm; p = 0.03) are flagged as significant factors for one out of three models; other factors are not statistically significant (p > 0.05).

Conclusion

Preliminary findings demonstrated the lack of robustness of IPF diagnosis model when the DL-based model is applied to CT series collected under different imaging protocols, which indicated that care should be taken as to the acquisition and reconstruction conditions used when developing and deploying DL models into clinical practice.

查看原文本刊更多论文

利用回顾性研究评估深度学习模型诊断特发性肺纤维化的稳健性。

背景：基于深度学习（DL）的系统尚未在临床实践中广泛应用，部分原因是多种成像协议的鲁棒性未知。目的：为此，我们旨在评估几种先前开发的基于dl的模型的性能，这些模型在标准化参考CT成像方案下被训练以区分间质性肺病（ILD）患者的特发性肺纤维化（IPF）和非IPF。在这项研究中，我们利用不同成像方案获得的非ipf受试者的CT扫描来评估模型的性能。方法：先前已经建立了三个基于dl的模型，包括一个2D和两个3D模型，根据胸部CT扫描将ILD患者分为IPF和非IPF。这些模型是根据来自5个多中心研究的389名IPF和700名非IPF患者的回顾性CT图像数据进行训练的。对于一些患者，需要进行多次CT扫描（例如，一次吸入和一次呼出）和/或重建（例如，薄层和/或厚层）。因此，对于每个患者，选择一个CT图像数据集用于构建分类模型，因此该数据集的参数作为参考条件。在一项非ipf研究中，由于其特定的研究方案，许多患者有多个CT图像数据集，这些数据集是在俯卧位和仰卧位下获得的，并且/或在不同的成像参数下重建的。因此，为了评估先前建立的模型在不同（例如，非参考）成像方案下的稳健性，我们从本研究中确定了343名受试者，他们的CT数据来自参考条件（用于模型构建）和非参考条件（例如，评估条件），我们将其用于模型评估分析。我们报道了三种模型在非参考条件下的特异性。采用广义线性混合效应模型（Generalized linear mixed effects model， GLMM）识别导致参考条件与评价条件诊断结果不一致的重要CT技术及临床参数。所选参数包括有效管电流时间积（称为“有效mAs”）、重建核、层厚、患者体位（俯卧位或仰卧位）、CT扫描仪模型、临床诊断。局限性包括本研究的回顾性。结果：对于所有三种DL模型，先前训练的IPF诊断模型的总体特异性下降（p 0.05）。结论：初步发现基于DL的IPF诊断模型应用于不同成像方案下采集的CT序列时，IPF诊断模型缺乏鲁棒性，提示在开发和应用DL模型到临床时应注意获取和重建的条件。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Medical physics 医学-核医学

CiteScore

6.80

自引率

15.80%

发文量

660

审稿时长

1.7 months

期刊介绍： Medical Physics publishes original, high impact physics, imaging science, and engineering research that advances patient diagnosis and therapy through contributions in 1) Basic science developments with high potential for clinical translation 2) Clinical applications of cutting edge engineering and physics innovations 3) Broadly applicable and innovative clinical physics developments Medical Physics is a journal of global scope and reach. By publishing in Medical Physics your research will reach an international, multidisciplinary audience including practicing medical physicists as well as physics- and engineering based translational scientists. We work closely with authors of promising articles to improve their quality.